Category Archives: Evolution

R notes: How to get genetic distances from a tree

This note is about extracting the patristic distances from a Newick tree for a set of OTUs. Note: a patristic distance is basically the sum of the branch lengths linking two nodes in a tree.

A number of fantastic packages in R are available to work with phylogeny and sequences. Similarly, a number of folks have been kind enough to share their notes on use of R and phylogenetics — this note derives from their work and is presented here to assist me with teaching students. In no particular order, references used are:

We will use the “ape” package along with a general package called “spaa” which helps manipulate output. The result is a CSV text file with three columns (number, pairs of OTU, distances), columns separated with commas.

In order to use R it must be installed on your computer.  Click here for instructions to  install R on your Mac computer or here for installation on a Windows 10 computer.

About the example data

In addition to a working version of R on your computer, you need to have saved your tree in Newick format (and recall where the file is on your computer :-). The example tree

(Mouse:0.0604463,((Alligator:0.0407394,Chicken:0.038893):0.0554883,Xenopus:0.216882):0.100985,Human:0.0320104);

accession numbers: NP_001521,  then blastp retrieved NP_001300848, NP_989628, XP_019349624, NP_001080449

aligned sequences by ClustalW (default settings), tree built on distances (Jones-Taylor-Thornton) and Phylip neighbor joining method within Unipro UGENE workbench.

The R script

Start your R application software.

If you have not downloaded and installed ape and spaa, do so now. Click here for instructions for Macs and here for Windows.

Here’s the script in R (the “#” indicates comments and are not interpreted by R — I’ve added blue color to comments). Don’t type the “>”, that’s the R prompt. Type everything after the “>” exactly as written (yes, you can change the object names).

#Get patristic distances. First, load the ape library

> library(ape)
#Load your phylogenetic tree, Newick format. This example is based on Clustal Omega-aligned HIF1A sequences obtained by blastp. Note that you would need to change the text pointing to the folder location
# this command finds the working directory
> getwd()
#use this command to change to your BI308L folder — note, this is just an example, yours will be different!
> setwd(“/my BI308L folder/Trees”)
#because I set the working directory with setwd, I have access to all files in that folder. Here, I load my newick file
> mytree = read.tree(“HIF1A.nwk”)
#Check that the tree file loaded correctly by plotting it (see below for the image)
> plot(mytree, type=”phylogram”, edge.width = 2)
#Add the pairwise distances; A patristic distance is the sum of the lengths of the branches that link two nodes in a tree
> PatristicDistMatrix = cophenetic.phylo(mytree)
#Display the pairwise distances from the tree. A square matrix results. Print the distance matrix.
> PatristicDistMatrix

              Mouse Alligator   Chicken   Xenopus     Human
Mouse     0.0000000 0.2576590 0.2558126 0.3783133 0.0924567
Alligator 0.2576590 0.0000000 0.0796324 0.3131097 0.2292231
Chicken   0.2558126 0.0796324 0.0000000 0.3112633 0.2273767
Xenopus   0.3783133 0.3131097 0.3112633 0.0000000 0.3498774
Human     0.0924567 0.2292231 0.2273767 0.3498774 0.0000000

Now, I could get impatient and then grab (copy/paste) the distances from the matrix and place into my Excel file. I’d then have to edit the file to get the distances into the correct pair-wise format. A messy step, not recommended.

Continue to read for better solution

Install and load the spaa library

library(spaa)
>disMatrix <-as.dist(PatristicDistMatrix) #tell R that we are working with a distance object
>outfile <- dist2list(disMatrix)
>outfile #if all go’s well, you will see three columns with 25 rows of data like below
col         row     value

1 Mouse     Mouse 0.0000000
2 Alligator Mouse 0.2576590

25 Human    Human 0.0000000
>write.csv(outfile,”outfile.csv”, col.names=NA) # this command will write a text only file called outfile.csv to your working directory. You can then import it to Excel or other spreadsheet application. The columns are  separated by commas (hence the ?csv)

Here’s the plot of “mytree”, unrooted, from R

NJ gene tree (HIF1A), unrooted

Notes on Cambrian Explosion and origins of genes

One of the benefits of teaching at University is that you need to stay current in your field, but also be able to explain new findings in contest of larger issues. One can only do so much, but clearly as one who teaches about evolution, need to keep up with the big historical concents like transitional fossils and the Cambrian Explosion. Not my field, I’m more comfortable with evolutionary genetics. But in preparing for a lecture I did a little work on these areas and list here some of the references I found useful.

Erwin et al (2011) The Cambrian Conundrum: Early Divergence and Later Ecological Success in the Early History of Animals. Science 334(6059):1091-1097. Puts new fossil findings into context with evidence for environmental changes. Suggests role for acquisition of new forms of regulation of development as key.

Fossils come in to land — Covers a point-of-view debate over whether fossils deemed to be early marine organisms found in rocks of the Ediacaran period in South Australia were instead evidence for fossilized soils. This would suggest that these Ediacaran organisms lived on land. If true, that’s an invasion of land much earlier than the dates in textbooks.

Origins of new genes http://www.nature.com/nrg/journal/v4/n11/abs/nrg1204.html

Lyson et al (2010) Transitional fossils and the origins of turtles. Biology Letters 6(6):830-833 Discusses role of new fossils applied to phylogeny reconstruction and adds to the debate over whether turtles form clade with Diapsids or Archosaurs or are outside these relationships. Discusses differences in results/implications between morphology and molecular datasets.