This note is about extracting the patristic distances from a Newick tree for a set of OTUs. Note: a patristic distance is basically the sum of the branch lengths linking two nodes in a tree.
A number of fantastic packages in R are available to work with phylogeny and sequences. Similarly, a number of folks have been kind enough to share their notes on use of R and phylogenetics — this note derives from their work and is presented here to assist me with teaching students. In no particular order, references used are:
- The Molecular Ecologist
- Liam Revell at http://blog.phytools.org/
- Paradis, E. (2011) Analysis of Phylogenetics and Evolution with R. Springer (Amazon)
We will use the “ape” package along with a general package called “spaa” which helps manipulate output. The result is a CSV text file with three columns (number, pairs of OTU, distances), columns separated with commas.
About the example data
In addition to a working version of R on your computer, you need to have saved your tree in Newick format (and recall where the file is on your computer :-). The example tree
accession numbers: NP_001521, then blastp retrieved NP_001300848, NP_989628, XP_019349624, NP_001080449
aligned sequences by ClustalW (default settings), tree built on distances (Jones-Taylor-Thornton) and Phylip neighbor joining method within Unipro UGENE workbench.
The R script
Start your R application software.
If you have not downloaded and installed ape and spaa, do so now. Click here for instructions for Macs and here for Windows.
Here’s the script in R (the “#” indicates comments and are not interpreted by R — I’ve added blue color to comments). Don’t type the “>”, that’s the R prompt. Type everything after the “>” exactly as written (yes, you can change the object names).
#Get patristic distances. First, load the ape library
#Load your phylogenetic tree, Newick format. This example is based on Clustal Omega-aligned HIF1A sequences obtained by blastp. Note that you would need to change the text pointing to the folder location
# this command finds the working directory
#use this command to change to your BI308L folder — note, this is just an example, yours will be different!
> setwd(“/my BI308L folder/Trees”)
#because I set the working directory with setwd, I have access to all files in that folder. Here, I load my newick file
> mytree = read.tree(“HIF1A.nwk”)
#Check that the tree file loaded correctly by plotting it (see below for the image)
> plot(mytree, type=”phylogram”, edge.width = 2)
#Add the pairwise distances; A patristic distance is the sum of the lengths of the branches that link two nodes in a tree
> PatristicDistMatrix = cophenetic.phylo(mytree)
#Display the pairwise distances from the tree. A square matrix results. Print the distance matrix.
Mouse Alligator Chicken Xenopus Human
Mouse 0.0000000 0.2576590 0.2558126 0.3783133 0.0924567
Alligator 0.2576590 0.0000000 0.0796324 0.3131097 0.2292231
Chicken 0.2558126 0.0796324 0.0000000 0.3112633 0.2273767
Xenopus 0.3783133 0.3131097 0.3112633 0.0000000 0.3498774
Human 0.0924567 0.2292231 0.2273767 0.3498774 0.0000000
Now, I could get impatient and then grab (copy/paste) the distances from the matrix and place into my Excel file. I’d then have to edit the file to get the distances into the correct pair-wise format. A messy step, not recommended.
Continue to read for better solution
Install and load the spaa library
>disMatrix <-as.dist(PatristicDistMatrix) #tell R that we are working with a distance object
>outfile <- dist2list(disMatrix)
>outfile #if all go’s well, you will see three columns with 25 rows of data like below
col row value
1 Mouse Mouse 0.0000000
2 Alligator Mouse 0.2576590
25 Human Human 0.0000000
>write.csv(outfile,”outfile.csv”, col.names=NA) # this command will write a text only file called outfile.csv to your working directory. You can then import it to Excel or other spreadsheet application. The columns are separated by commas (hence the ?csv)
Here’s the plot of “mytree”, unrooted, from R