README file for puzzleboot version 1.03 What is it for? puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE 4.0 to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do). When should you use it? If you wish to calculate distance matrices for a bootstrapped dataset you may want to use puzzleboot. One of the options of PUZZLE 4.0 is to calculate a maximum likelihood distance matrix from a nucleotide or amino acid dataset using complex (and probably realistic) models of nucleotide or protein evolution. PUZZLE 4.0 is particularly useful for protein distance analysis since it allows the use of an amino acid substitution model (Dayhoff, JTT, Blossum 62 and mtREV) as well as the possibility of correcting for among-site rate variation (using a gamma-distribution model, a gamma+invariable sites model or a variable+invariable sites model). Although PUZZLE 4.0 behaves like a PHYLIP program in many respects, it does not allow the analysis of multiple datasets (the 3M2 option of many PHYLIP programs) that is necessary for bootstrap analysis. Thus puzzleboot was created to trick PUZZLE into analysing multiple datasets. Although it was originally created for protein distance bootstrap analysis, it can equally be used for the DNA distance analyses that PUZZLE 4.0 can perform (see the PUZZLE manual for more details). What you need: You need a computer running the UNIX operating system with PUZZLE 4.0 installed. It would also be helpful if this computer had the PHYLIP programs installed since puzzleboot is used after SEQBOOT and before programs like FITCH or NEIGHBOR. Having PHYLIP running on some platform is essential for the use of puzzleboot, although it need not be in the same UNIX environment. Before you use puzzleboot - IMPORTANT: Puzzleboot writes lots of temporary files and then automatically deletes them after using them. Specifically, at the end of its run, puzzleboot will delete any file called: outdist, outfile or outtree that is present in the directory. Therefore we STRONGLY RECOMMEND that you create a separate directory to run puzzleboot in, so that files generated by previous PUZZLE runs are not deleted Installation note: There is a line in puzzleboot which must be edited to reflect the location of your version of PUZZLE 4.0. This line now reads: PUZZLE=/du/opt/puzzle-4.0.2/bin/puzzle Before running the program, replace /du/opt/puzzle-4.0.2/bin/puzzle with the absolute path of PUZZLE 4.0 on your machine (if you do not know this, then ask your system administrator for this info). How to use it: Generally, you will want to run the SEQBOOT on your dataset to generate N bootstrap resampled datasets. The file containing the bootstrapped datasets (the 'outfile' of SEQBOOT) then is used directly with puzzleboot. After puzzleboot is complete, it will generate a file containing N concatenated distance matrices corresponding to the N datasets in the original SEQBOOT outfile. You then use one of the distance-based tree-building programs from PHYLIP (such as NEIGHBOR or FITCH) on this output file. Before you run puzzleboot, you must write file in your directory, called puzzle.cmds that contains the settings you would like PUZZLE to run with. Please refer to the PUZZLE 4.0 manual for all of the options available in running the program. What follows is a brief description of how to write a puzzle.cmds file. PUZZLE runs like a PHYLIP program by using single letters to toggle between various options on the menu of the program. Typically, you will want to change PUZZLE to only make a distance matrix and you will want to change the models of evolution. The file called puzzle.cmds simply contains each single-letter command you would normally issue to toggle to the correct settings separated by a carriage return. A typical puzzle.cmds file would look like: k k w w w i 0.1 a 0.7 y This example puzzle.cmds file sets PUZZLE to 'distance matrix analysis only' mode (the 2 k's), with a gamma+invariable sites model of rate heterogeneity (the 3 w's), a user-defined proportion of invariable sites (i) of 0.1 and a gamma shape parameter (a) of 0.7. To make puzzleboot run, you MUST finish the puzzle.cmds file with a 'y' (this tells PUZZLE to go ahead with the settings given). In order to familiarize yourself with what options would be possible and/or desirable for you to put in your puzzle.cmds file, you should launch the PUZZLE program and play around with the various menu options. Once you have created your puzzle.cmds file, then simply run puzzleboot (if puzzleboot is in your directory or in your path) by simply typing: puzzleboot filename where filename is the name of the file with the bootstrapped datasets generated by SEQBOOT. IMPORTANT NOTE: Before running puzzleboot you should change the name of the SEQBOOT output file from 'outfile' to something else, since puzzleboot is programmed to delete files called outfile during its operation. Also when it runs, it temporarily generates a large number of files in your directory. Do not delete these - when the program is finished, it will clean up after itself. Puzzleboot output: The output of puzzleboot is a file called: filename.outdist where filename is the original filename that you fed the program. This file contains your N distance matrices back-to-back. To make trees from these distance matrices you can use the PHYLIP programs FITCH or NEIGHBOR with the multiple datasets option (as you normally would in any PHYLIP distance analysis). After you have done this, then you use CONSENSE to build your bootstrap majority-rule consensus tree. Final notes: You do not need to tell the program how many bootstrap analyses you have done - it calculates this automatically. The program also has restart capability. So if your run is interrupted by a computer crash, when you log in again the next time, simply type 'puzzleboot filename' once again and the program will continue where it left off. If you have questions please ask Andrew Roger (roger@is.dal.ca) or Mike Holder (holder@uh.edu). Good luck! Acknowledgement: This work was originally conceived and developed while both of us were in the employ of Mitch Sogin at the Marine Biological Laboratory, Woods Hole, MA 02543 USA.