Bayesian Dating on the cluster


Jeff Thorne has written a program (multidivtime) that allows for the estimation of divergence times given a phylogeny and some sequence data (or AA data). For theoretical background on the methodology refer to the references below. Frank Rutschmann has provided a friendly manual on how to use multidivtime, estbranches, and PAML to estimate divergence times (Rutschmann 2004). You should read the manual if this is new to you.

The two scripts described below, prepjobs.multidiv and runjobs.multidiv simplify the implementation of this methodology on the Beowulf cluster at the University of Missouri Saint Louis.

I. The first script, prepjobs.multidiv, estimates the parameters for the model of evolution (program baseml) and uses these parameters to estimate the maximum likelihood of the branch lengths for the tree, and the variance-covariance matrix (estbranches). This script lumps steps 3-6 from the manual.

To use this script:

  1. You need to place the following files into an empty folder (use the files provided here as they contain the lines required to run this script).
    • baseml.ctl You may want to change some parameters in this file although there is only one model of evolution (F84) that currently works with the dating software. See the manual for details. DO NOT CHANGE THE LINES REFERRING TO FILE NAMES.
    • hmmcntrl.dat
    • inseed
    • your data file named gene1.seq - if you are using multiple partitions create a file for each partition and name them gene1.seq, gene2.seq, gene3.seq, etc. Follow this example for file format.
    • your tree files labeled gene1.tree, gene2.tree, etc. (If you are using the same tree for all partitions make copies of the file for each partition). Follow this example for file format (note that outgroup taxa are placed at the end and the number of taxa in the outgroup is indicated by the second number in the first line).
    • your master tree file named finaltree (this is the tree you want to get dates for, if it is the same as above just make a copy and name it finaltree).
    • Tip: In your tree files include at least two outgroup taxa but list only one in the head line, as the software will remove the outgroup in the dating step. This will allow you to get dates for your ingroup root node or to use this node as a calibration point.
  2. To run the first script type prepjobs.multidiv and follow the screen instructions.
  3. When the job has ended you will receive an email message. Several files and a folder for each gene (gene1, gene2, etc) should be now in your working folder. Refer to the output file prep_multidiv.out for log output of the analyses included in the first script.
  4. Read the file lnL_perform that was created in your directory and contains the likelihood scores obtained with both baseml and estbranches that the script uses. If the likelihood scores are very different, you may need to either change parameters in the baseml.ctl file or rerun the first script (see step 6 in Rutschmann 2004).

II. The second script, called runjobs.multidiv, runs multidivtime software in the cluster, which performs a Bayes MCMC analysis to obtain the posterior distributions of substitution rates and the divergence times (step 7 in Rutschmann 2004).

To use this script:

  1. Have the following files in a folder:
    • oest.Gene1, oest.Gene2 These files were produced by the first script and should be in your directory already. It contains the tree with estimated branch lengths and variance-covariance matrix; the program estbranches should have removed the outgroup.
    • finaltree (see above)
    • inseed
    • multicntrl.dat This is the control file that specifies the MCMC parameters and constrains.

  2. Edit the multicntrl.dat file. The following parameters must be specified for your analysis. You may use the defaults for the other parameters (but see Rutschmann 2004 for details)
    • a) Specify the number of genes and the respective oest.gene# file names;
    • b) Specify the priors for the age of the clade and determine the age unit (rttm and rttmsd). rttm is "the mean of the prior distribution for the time separating the ingroup root from the present". rttmsd is the standard deviation of this prior distribution. Choice of the values for rttm and rttmsd depends on what is known by the user regarding the sequences and organisms being studied. (...) You are free to set the time units to be what you want. Of course, you want to then adjust the units for the rates to correspond to the units for the times. For example, let's say that rttm should be about 20 million years and rtrate should be about 0.1 changes per 10 million years. You could make rttm equal to 2.0 where 1.0 time unit is 10 million years. You could then make rtrate equal to 0.1. " Rttm should stay "between 0.1 and 10 time units. This is the range where the MCMC proposal parameters seem to be best for achieving convergence of the Markov chain" (Rutschmann 2004).
    • c) Specify the parameter bigtime to the time units used (it should be higher than rttm).
    • c) Specify the priors for the rate of molecular evolution at the root node (rtrate and rtratesd). "Choosing a reasonable value of rtrate is difficult. My usual strategy is not statistically rigorous but it seems to work reasonably well. First, I use the estbranches program to estimate amounts of evolution from the ingroup root to the ingroup tips [tree with branch lengths within the oest.gene# files outputed by the prepjobs.multidiv script]. These estimated amounts of evolution from the ingroup root to the ingroup tips will differ depending on the tips. I usually pick an amount that is close to the median of the amount of evolution for the different tips. I'll refer to this amount as X. Remembering that the amount of evolution is a rate multiplied by a time, I set rtrate to X divided by rttm. For the value of rtratesd, a big standard deviation (e.g., setting rtratesd to equal rtrate) may be reasonable when there is little knowledge about evolutionary rates" (Rutschmann 2004). This value for rtrate can be calculated using this script with the software R (http://cran.r-project.org/; requires package Ape) and the trees in the oest.gene# files.
    • d) Specify your age constraints. The first script has produced a file called nodes_numbers that contains your tree with the nodes numbers (note that in this tree the outgroup was removed). You need to specify the number of constraints and the date for each constrained node. Use L for a minimum age for the node and U for a maximum age. Remember to use the same time units as specified above (in b).

    Example:

    3 ... number of constraints on node times

    L 5 1.0 ... constrain node 5 to be at least 1 time unit of age

    U 7 8.0 ... constrain node 7 to be no more than 8 time units of age

  3. Run the second script by typing runjobs.multidiv and follow instructions. You will be asked to specify an output file. You should consider running this script at least twice to check for consistency. It may also be important to try different parameters in the multicntrl.dat file.

Links:

Frank Rutschmann's manual.

Multidivtime website.

References:

Kishino, H., Thorne, J.L., Bruno, W.J., 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol. Biol. Evol. 18, 352–361.

Thorne, J.L., Kishino, H., Painter, I.S., 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15: 1647–1657.

Thorne, J.L., Kishino, H., 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 51, 689–702.