Category Archives: Bioinformatics

LD50 & dose response calculations with R

In toxicology, the dose of a pathogen, radiation, or toxin required to kill half the members of a tested population of animals or cells is called the lethal dose, 50%, or LD50. This measure is also known as the lethal concentration, LC50, or properly after a specified test duration, the LCt50 indicating the lethal concentration and time of exposure. LD50 figures are frequently used as a general indicator of a substance’s acute toxicity. A lower LD50 is indicative of increased toxicity.

More generally, the point at which 50% response of studied organisms to range of doses of a substance (e.g., agonist, antagonist, inhibitor, etc.) to any response, from change in behavior or life history characteristics up to and including death can be described by the methods described in this chapter. The procedures outlined below assume that there is but one inflection point, i.e., an “s-shaped” curve, either up or down; if there are more than one inflection points, then the logistic equations described will not fit the data well and other choices need to be made (see Di Veroli et al 2015). We will use the drc package (Ritz et al 2015).

Example

After starting R, load the drc library.

library(drc)

Consider some hypothetical 24-hour survival data for yeast exposed to salt solutions. Let resp equal the variable for frequency of survival (e.g., estimated from OD600 readings) and NaCl equal the millimolar (mm) salt concentrations or doses.

At the R prompt type

resp=c(1,1,1,.9,.7,.3,.4,.2,0,0,0)
NaCl=seq(0,1000,100)
#To check to make sure that the sequence has been correctly created; alternatively, just enter the values.
NaCl [1] 0 100 200 300 400 500 600 700 800 900 1000
#Make a plot
plot(NaCl,resp,pch=19,cex=1.2,col=”blue”,xlab=”NaCl [mm]”,ylab=”Survival frequency”)

And here is the plot of the simulated data (Figure 1).

Figure 1. Plot of yeast survival in different amounts of salt, simulated data

Note the sigmoidal shape — we’ll need an logistic equation to describe the relationship between survival of yeast and NaCl doses.

The equation for the four parameter logistic curve, also called the Hill-Slope model, is (Figure 2)

Figure 2. Equation of the four parameter logistic curve.

where c is the parameter for the lower limit of the response, d is the parameter for the upper limit of the response, e is the relative EC50, the dose fitted half-way between the limits c and d, and b is the relative slope around the EC50. The slope, b, is also known as the Hill slope. Because this experiment included a dose of zero, a three parameter logistic curve would be appropriate. The equation simplifies to (Figure 3)

Figure 3. Equation of the three parameter logistic curve

EC50 from 4 parameter model

First, make a data frame

dose = data.frame(NaCl,resp)

Next, call up a function, drm, from the drc library and specify the model as the four parameter logistic equation, specified as LL.4(). We follow with a call to the summary command to retrieve output from the drm function. Note that we are using the four-parameter logistic equation (Figure 2)

model.dose1 = drm(dose,fct=LL.4())
summary(model.dose1)

And here is the output from R.

Model fitted: Log-logistic (ED50 as parameter) (4 parms)

Parameter estimates:

                   Estimate  Std. Error    t-value  p-value
b:(Intercept)      3.753415    1.074050   3.494636   0.0101
c:(Intercept)     -0.084487    0.127962  -0.660251   0.5302
d:(Intercept)      1.017592    0.052460  19.397441   0.0000
e:(Intercept)    492.645128   47.679765  10.332373   0.0000

Residual standard error:

0.0845254 (7 degrees of freedom)

The EC50, technically, because the data were for survival, the LD50 is e = 492.65 mM NaCl, where e, again, is the dose fitted half-way between the limits c and d.

You should always plot the predicted line from your model against the real data and inspect the fit.

At the R prompt type

lines(dose,predict(model.dose1, data.frame(x=dose)),col=”red”)

As long as the plot you made in earlier steps is still available, R will add the line specified in the lines command. Here is the plot with the predicted logistic line displayed (Figure 4).

Figure 4. Plot with the predicted logistic line displayed

While there are additional steps we can take to decide is the fit of the logistic curve was good to the data, visual inspection suggests that indeed the curve fits the data reasonably well.

More work to do

Because the EC50 calculations are an estimate, we should also obtain confidence intervals. The drc library provides a function called ED which will accomplish this. We can also ask what the survival was at 10% and 90% in addition to 50%, along with the confidence intervals for each.

At the R prompt type

ED(model.dose1,c(10,50,90), interval=”delta”)

And the output is shown below.

Estimated effective doses
(Delta method-based confidence interval(s))

Estimate Std. Error   Lower   Upper
1:10  274.348     38.291 183.803  364.89
1:50  492.645     47.680 379.900  605.39
1:90  884.642    208.171 392.395 1376.89

Thus, the 95% confidence interval for the EC50 calculated from the four parameter logistic curve was between the lower limit of 379.9 and an upper limit of 605.39 mm NaCl.

EC50 from 3 parameter model

Looking at the summary output from the four parameter logistic function we see that the value for c was -0.085 and p-value was 0.53, which suggests that the lower limit was not statistically different from zero. We would expect this given that the experiment had included a control of zero mm added salt. Thus, we can explore by how much the EC50 estimate changes when the additional parameter c is no longer estimated by calculating a three parameter model with LL.3(). At the R prompt type

model.dose2 = drm(dose,fct=LL.3())
summary(model.dose2)

And here is the output.

Model fitted: Log-logistic (ED50 as parameter) with lower limit at 0 (3 parms)

Parameter estimates:

Estimate Std. Error   t-value p-value
b:(Intercept)   4.46194    0.76880   5.80378   4e-04
d:(Intercept)   1.00982    0.04866  20.75272   0e+00
e:(Intercept) 467.87842   25.24633  18.53253   0e+00

Residual standard error:

0.08267671 (8 degrees of freedom)

The EC50 is the value of e: 467.88 mM NaCl.

How do the four and three parameter models compare? We can rephrase this as as statistical test of fit; which model fits the data better, a three parameter or a four parameter model?

At the R prompt type

anova(model.dose1, model.dose2)

The output is below

1st model
fct:      LL.3()
2nd model
fct:      LL.4()

ANOVA table

          ModelDf      RSS Df F value p value
2nd model       8 0.054684
1st model       7 0.050012  1  0.6539  0.4453

Because the p-value is much greater than 5% we may conclude that the fit of the four parameter model was not significantly better than the fit of the three parameter model. Thus, based on your criteria of model fit (e.g., select a more complicated model if it demonstrates an improvement over a model with fewer predictors), we would conclude that the three parameter model is the preferred model.

The plot below (Figure 5) now includes the fit of the four parameter model (red line) and the three parameter model (green line) to the data.

Figure 5. Plot now includes the fit of the four parameter model (red line) and the three parameter model (green line) to the data.

The R command to make this addition to our active plot was

lines(dose,predict(model.dose2, data.frame(x=dose)),col=”green”)

We continue with our analysis of the three parameter model and produce the confidence intervals for the EC50 (modify the ED() statement above for model.dose2 in place of model.dose1).

Estimated effective doses
(Delta method-based confidence interval(s))

     Estimate Std. Error   Lower  Upper
1:10  285.937     33.154 209.483 362.39
1:50  467.878     25.246 409.660 526.10
1:90  765.589     63.026 620.251 910.93

Thus, the 95% confidence interval for the EC50 calculated from the three parameter logistic curve was between the lower limit of 409.7 and an upper limit of 526.1 mm NaCl. The difference between upper and lower limits was 116.4 mm NaCl, a smaller difference than the interval calculated for the 95% confidence intervals from the four parameter model (225.5 mm NaCl). This demonstrates the estimation trade-off: more parameters to estimate reduces the confidence in any one parameter estimate.

Additional notes of EC50 calculations

Care must be taken that the model fits the data well. What if we did not have observations throughout the range of the sigmoidal shape? We can explore this by taking a subset of the data. At the R prompt type

dd = dose[1:6,]

Here, all values greater than dose 500 were dropped (see below for a more general approach to subset).

> dd
resp dose
1  1.0    0
2  1.0  100
3  1.0  200
4  0.9  300
5  0.7  400
6  0.3  500

and the plot does not show an obvious sigmoidal shape (Figure 6)

Figure 6. Plot of subset of data. No longer “sigmoidal” curve.

We run the three parameter model again, this time on the subset of the data.

model.dosedd = drm(dd,fct=LL.3())
summary(model.dosedd)

Output from the results are

Model fitted: Log-logistic (ED50 as parameter) with lower limit at 0 (3 parms)

Parameter estimates:

                Estimate Std. Error    t-value p-value
b:(Intercept)   6.989842   0.760112   9.195801  0.0027
d:(Intercept)   0.993391   0.014793  67.153883  0.0000
e:(Intercept) 446.882542   5.905728  75.669344  0.0000

Residual standard error:

0.02574154 (3 degrees of freedom)

Conclusion? The estimate is different, but only just so, 447 vs. 468 mm NaCl. Thus, within reason, the drc function performs well for the calculation of EC50. Not all tools available to the student will do as well.

Use the subset function instead:

dd = subset(dose, NaCl <= 500)

References

Di Veroli G. Y., Fornari C., Goldlust I., Mills G., Koh S. B., Bramhall J. L., Richards, F. M., Jodrell D. I. (2015) An automated fitting procedure and software for dose-response curves with multiphasic features. Scientific Reports 5: 14701. (doi: 10.1038/srep14701)

Ritz, C., Baty, F., Streibig, J. C., Gerhard, D. (2015) Dose-Response Analysis Using R. PLOS ONE, 10(12),
e0146021 (doi: 10.1371/journal.pone.0146021

How to install R on Windows 10 PC

  1. Go to https://cran.r-project.org/
  2. Select Download R for Windows
  3. Select “base” and then click on “Download R 3.4.0 for Windows” to get the latest version. As of June 2017 that would be

R-3.4.0

  1. Download the file; once completed, click on the file to begin installation. Accept defaults.
  2. After R has been installed, start the application to work with the R statistical software.

Next: How to install Rcmdr and other R packages

How to install R on your Mac

You must install XQuartz, an X windowing system, before you install R.

Install XQuartz on your Mac

  1. Got to https://www.xquartz.org/
  2. Select the latest version. As of June 2017 that would be

XQuartz-2.7.11.dmg

  1. Download the file; once completed, click on the file to begin installation. Accept defaults.
  2. If your computer replies with a warning message about installing from unknown sources, you’ve run into “Gatekeeper.” Click here for help with Gatekeeper.
  3. After XQuartz has been installed, it is good practice to restart your computer before proceeding.
  4. Now you can install R.

Note: After updating the operating system, it is recommended that you reinstall XQuartz.

Here’s what Apple has to say about why you need to install XQuartz.

Install R on your Mac

  1. Go to https://cran.r-project.org/
  2. Select Download R for (Mac) OS X
  3. Select the latest version. As of June 2017 that would be

R-3.4.0.pkg

  1. Download the file; once completed, click on the file to begin installation. Accept defaults.
  2. If your computer replies with a warning message about installing from unknown sources, you’ve run into “Gatekeeper.” Click here for help with Gatekeeper.
  3. After R has been installed, start the application to work with the R statistical software.

Next: How to install Rcmdr and other R packages

 

R notes: How to get genetic distances from a tree

This note is about extracting the patristic distances from a Newick tree for a set of OTUs. Note: a patristic distance is basically the sum of the branch lengths linking two nodes in a tree.

A number of fantastic packages in R are available to work with phylogeny and sequences. Similarly, a number of folks have been kind enough to share their notes on use of R and phylogenetics — this note derives from their work and is presented here to assist me with teaching students. In no particular order, references used are:

We will use the “ape” package along with a general package called “spaa” which helps manipulate output. The result is a CSV text file with three columns (number, pairs of OTU, distances), columns separated with commas.

In order to use R it must be installed on your computer.  Click here for instructions to  install R on your Mac computer or here for installation on a Windows 10 computer.

About the example data

In addition to a working version of R on your computer, you need to have saved your tree in Newick format (and recall where the file is on your computer :-). The example tree

(Mouse:0.0604463,((Alligator:0.0407394,Chicken:0.038893):0.0554883,Xenopus:0.216882):0.100985,Human:0.0320104);

accession numbers: NP_001521,  then blastp retrieved NP_001300848, NP_989628, XP_019349624, NP_001080449

aligned sequences by ClustalW (default settings), tree built on distances (Jones-Taylor-Thornton) and Phylip neighbor joining method within Unipro UGENE workbench.

The R script

Start your R application software.

If you have not downloaded and installed ape and spaa, do so now. Click here for instructions for Macs and here for Windows.

Here’s the script in R (the “#” indicates comments and are not interpreted by R — I’ve added blue color to comments). Don’t type the “>”, that’s the R prompt. Type everything after the “>” exactly as written (yes, you can change the object names).

#Get patristic distances. First, load the ape library

> library(ape)
#Load your phylogenetic tree, Newick format. This example is based on Clustal Omega-aligned HIF1A sequences obtained by blastp. Note that you would need to change the text pointing to the folder location
# this command finds the working directory
> getwd()
#use this command to change to your BI308L folder — note, this is just an example, yours will be different!
> setwd(“/my BI308L folder/Trees”)
#because I set the working directory with setwd, I have access to all files in that folder. Here, I load my newick file
> mytree = read.tree(“HIF1A.nwk”)
#Check that the tree file loaded correctly by plotting it (see below for the image)
> plot(mytree, type=”phylogram”, edge.width = 2)
#Add the pairwise distances; A patristic distance is the sum of the lengths of the branches that link two nodes in a tree
> PatristicDistMatrix = cophenetic.phylo(mytree)
#Display the pairwise distances from the tree. A square matrix results. Print the distance matrix.
> PatristicDistMatrix

              Mouse Alligator   Chicken   Xenopus     Human
Mouse     0.0000000 0.2576590 0.2558126 0.3783133 0.0924567
Alligator 0.2576590 0.0000000 0.0796324 0.3131097 0.2292231
Chicken   0.2558126 0.0796324 0.0000000 0.3112633 0.2273767
Xenopus   0.3783133 0.3131097 0.3112633 0.0000000 0.3498774
Human     0.0924567 0.2292231 0.2273767 0.3498774 0.0000000

Now, I could get impatient and then grab (copy/paste) the distances from the matrix and place into my Excel file. I’d then have to edit the file to get the distances into the correct pair-wise format. A messy step, not recommended.

Continue to read for better solution

Install and load the spaa library

library(spaa)
>disMatrix <-as.dist(PatristicDistMatrix) #tell R that we are working with a distance object
>outfile <- dist2list(disMatrix)
>outfile #if all go’s well, you will see three columns with 25 rows of data like below
col         row     value

1 Mouse     Mouse 0.0000000
2 Alligator Mouse 0.2576590

25 Human    Human 0.0000000
>write.csv(outfile,”outfile.csv”, col.names=NA) # this command will write a text only file called outfile.csv to your working directory. You can then import it to Excel or other spreadsheet application. The columns are  separated by commas (hence the ?csv)

Here’s the plot of “mytree”, unrooted, from R

NJ gene tree (HIF1A), unrooted