Running PAUP* on the Beowulf Cluster

 

PAUP* is not currently a parallel application. That is, a single PAUP* analysis currently cannot be sped up by performing it in parallel. Researchers using PAUP* on the cluster are realizing speedups because they often need to run multiple independent analyses using separate instances of PAUP* with different input. Being independent, all of these analyses can be run simultaneously. The methods described below allow researchers to effectively start multiple analyses at the same time. The procedure below assumes that you are already familiar with working with PAUP* (on your personal workstation, for example).

1. Organize Your Input Data

  • Overview:
    1. Log into cluster.
    2. Create a project directory.
    3. Transfer your data files (ending in ".nexus") to the project directory.
    4. Change into the project directory.
    5. If necessary, create a file listing the order of the datasets.
    6. Run prepjobs.paup
  • Login to the cluster and create a project directory to hold the files for this project. For example:
    • mkdir project_abc
  • Transfer all of your data files (in NEXUS file format) to this project directory:
    • NOTE: Make sure files end in ".nexus" (e.g., mydata.nexus)
    • Use sftp or scp to securely transfer files to the cluster.
  • Change into your project directory:
    • cd project_abc
  • If you will need to collect the output in a particular order, you should create a file called "dataset_order" that lists one file per line in the order you need. For example:
    • Create a file "dataset_order" containing:
      mydata_abc.nexus
      mydata_xyz.nexus
      mydata_qrs.nexus
  • Use prepjobs.paup to set up the proper environment for runjobs.paup (used in Step 2):
    • prepjobs.paup
    • Or if you have created an dataset_order file above:
      prepjobs.paup
      dataset_order

2. Start Your Analyses


The runjobs.paup script will be used to start your PAUP* jobs. All you need to decide is the range of datasets that you would like to analyze. runjobs.paup needs to be given the starting dataset and the ending dataset. So in order to run analyses for datasets 2-5 only (of the 10 described above), you would type:

runjobs.paup 2 5

Similarly, in order to run all 10 datasets, you would type:

runjobs.paup 1 10

The runjobs.paup script will output a series of "Job IDs" associated with your jobs (each analysis is a "job") in the batch scheduler (do not worry about keeping track of them). To see a listing of all of the jobs that are currently active on the cluster, use the qstat command. That is, type:

qstat

When your jobs are finished, they will no longer appear in this list.

3. Collect Your Output Data


When your jobs no longer appear with the qstat command, you can take a look in the directories for the finished jobs. For example, if the analyses for datasets 2, 4 and 9 appear to be finished, you can look for output in the ds2, ds4 and ds9 directories, respectively, which are all within the project_abc project directory created above.

If you can analyze each set of output files separately, you can begin transferring them back to your workstation at any time. If you need to combine the output files into one large output file in order to properly process them, you will obviously need to wait until all of the jobs have finished. At this point, if you need help combining the output files, please contact us for assistance.

its