*|***************************************************************************|* *| |* *| Program: phred |* *| Version: 0.961028 |* *| |* *| Copyright (C) 1993-1996 by Phil Green and Brent Ewing. |* *| All rights reserved. |* *| |* *| This software is a beta-test version of the phred package. |* *| It should not be redistributed or used for any commercial |* *| purpose, including commercially funded sequencing, without |* *| written permission from the author and the University of |* *| Washington. |* *| |* *| This software is provided ``AS IS'' and any express or |* *| implied warranties, including, but not limited to, the |* *| implied warranties of merchantability and fitness for a |* *| particular purpose, are disclaimed. In no event shall |* *| the authors or the University of Washington be liable for |* *| any direct, indirect, incidental, special, exemplary, or |* *| consequential damages (including, but not limited to, |* *| procurement of substitute goods or services; loss of use, |* *| data, or profits; or business interruption) however caused |* *| and on any theory of liability, whether in contract, strict |* *| liability, or tort (including negligence or otherwise) |* *| arising in any way out of the use of this software, even |* *| if advised of the possibility of such damage. |* *| |* *| Portions of the code benefit from ideas due to Dave Ficenec, |* *| LaDeana Hillier, Mike Wendl, and Tim Gleeson. These are |* *| indicated in the relevant source files. |* *| |* *|***************************************************************************|* PHRED Documentation ------------------- 1. Introduction. Phred reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files. Phred can read trace data from SCF files and ABI model 373 and 377 DNA sequencer chromat files, automatically detecting the file format. It automatically uncompresses compressed data files too. After calling bases, phred writes the sequences to files in either FASTA format, the format suitable for XBAP, PHD format, or the SCF format. Quality values for the bases are written to FASTA format files or PHD files, which can be used by the phrap sequence assembly program in order to increase the accuracy of the assembled sequence. 2. Acknowledgements. Phred benefits from ideas developed by LaDeana Hillier, Mike Wendl, Dave Ficenec, Tim Gleeson, and Alan Blanchard. 3. Algorithms. Phred uses simple Fourier methods to examine the four base traces in the region surrounding each point in the data set in order to predict a series of evenly spaced predicted locations. That is, it determines where the peaks would be centered if there were no compressions, dropouts, or other factors shifting the peaks from their "true" locations. Next phred examines each trace to find the centers of the actual, or observed, peaks and the areas of these peaks relative to their neighbors. The peaks are detected independently along each of the four traces so many peaks overlap. A dynamic programming algorithm is used to match the observed peaks detected in the second step with the predicted peak locations found in the first step. 4. Building and installing. The INSTALL file describes the steps for building and installing phred. 5. Running phred. Phred uses command line options to control input, processing, and output. The command line options are delimited by a dash, "-". For example, let us say you want to process a group of chromat files in the directory "chromat_dir". For each chromat file you want phred to call the bases, append the base calls to a file named "seqs_fasta", and append the quality values to another file named "qual_fasta". You would use the command $ phred -id data_files -sa seqs_fasta -qa qual_fasta Compressed chromat files: phred checks whether or not the file was compressed by either "gzip" or "compress". If the file was not compressed, phred reads and processes it immediately. If it was compressed, phred creates a symbolic link to the compressed file in the temporary directory and uncompresses file into the temporary directory without deleting the compressed version. Phred then reads, processes, and deletes the uncompressed file. The command line options are Input Options ------------- -idRead and process files in . -if Read and process files listed in the file . Each line in must specify a valid path to a single input file. -zd Location of compression program. If -zd is omitted, phred uses the current path to find the compression program. -zt Directory where chromat is uncompressed. If if -zt is omitted, phred uses /usr/tmp. When phred processes a compressed file, it first creates a symbolic link to the compressed file in this temporary directory before it uncompresses the file and reads it. It subsequently deletes the symbolic link and uncompressed file in the temporary directory. Processing Options ------------------ -nocall Disable phred base calling and set the current sequence to the ABI base calls that are read from the input file. By default, the current sequence is set to the phred base calls. This affects the base trimming and output options. -trim Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. In addition, specifies a base sequence that is used to trim bases off the start of the current sequence. You can specify a NULL enzyme sequence using empty double quotes, "". Output Options -------------- -st fasta Set the output sequence file format to FASTA. (Default.) -st xbap Set the output sequence file format to XBAP. -s Write sequence output files with the names obtained by appending ".seq" to the names of the input files, and store them in the directory where phred is running. -s Write a sequence output file with the name . This option is valid for a single input file only. -sd Write sequence output files with the names obtained by appending ".seq" to the names of the input files, and write them in the directory . -sa Write a sequence output file in FASTA format with the name . The file contains the base calls of all the reads processed in this run of phred. -qt fasta Set the output quality file format to FASTA. Trimmed off base quality values are set to zero. (Default.) -qt xbap Set the output quality file format to XBAP. Trimmed off base quality values are omitted. -qt mix Set the output quality file format to FASTA. Base quality values for all bases are written (including those for trimmed off bases). -q Write quality output files with the names obtained by appending ".qual" to the names of the input files, and store them in the directory where phred is running. This option is valid for FASTA format output files only. -q Write a quality output file with the name . This option is valid for a single input file and a FASTA format output file only. -qd Write quality output files with the names obtained by appending ".qual" to the names of the input files, and store them in the directory . -qa Write a quality output file in FASTA format with the name . The file contains the quality values of all the reads processed in this run of phred. -qr Write a histogram of the number of high quality bases per read. This is meaning- ful when phred processes more than one read. -c Write SCF files with the trace data, the base calls of the current sequences, and the positions of the base calls. The SCF files have the names of the input files (phred will refuse to write the SCF file if you ask it to write the SCF file in the directory in which the input file resides). -c Write an SCF file with the trace data, the base calls of the current sequence, and the positions of the base calls. The SCF file has the name . This option is valid for a single input file only. -cd Write SCF files with the trace data, the base calls of the current sequences, and the positions of the base calls. The SCF files are written in the directory and have the same names as the input files. -cp Store SCF trace data as 1 or 2 byte values. Defaults to 1 when the maximum trace value is less than 256, or to 2 when the maximum trace value is greater than or equal to 256. -p Write a PHD file, which is used by the consed editor to display bases. A PHD file contains a set of comments used by consed for maintaining consistency between the chromat file, the .ace file and the PHD file, and it contains base data as triples consisting of the base call, quality, and position. Phred always writes the first version of the PHD file for a read, which has the name .phd.1. When a read is edited using consed, a new version of the phd is written by consed, for example, the second version has the name .phd.2. With the -p option, is the name of the input file. -p Write a PHD file with the name .phd.1. This option is valid for processing a single input file. -pd Write PHD files in directory . The PHD files have the names .phd.1 where is the name of the input file. -d Write a data file that is used for detecting polymorphic bases. The file has the name .poly where is the name of the input file. The first line of the file consists of the sequence name, the smallest amplitude normalization factor, and the amplitude normalization factors for the A, C, G, and T traces. One line for each called base follows the header line. The information on each line consists of the called base, the position of the called base, the area of the called peak, the relative area of the called peak, the uncalled base, the position of the uncalled base, the area of the uncalled base, the relative area of the uncalled base, and the amplitudes of the four traces at the position of the called base. -dd Write polymorphism data files in directory . The files have the names .poly where is the name of the input file. -raw Write in the header of the sequence output file and the quality output file. By default, the name of the input file is written in the headers of these files. This option is valid for a single input file only. -log Make phred append a log entry describing the processing run in the file "phred.log". Miscellaneous ------------- -h, -help Display a command line option summary. -V Display phred version. Examples ------- If you plan to use phred base calls and base quality information as input to the phrap assembly program, run phred as follows. Let us say that you want to process the chromat files in subdirectory "chromat_dir". You want phred to write the base calls to a FASTA file named "seqs_fasta" and the base quality values to "seqs_fasta.qual". In this case you run phred with the options $ phred -id chromat_dir -sa seqs_fasta -qa seqs_fasta.qual Phred reads and processes each file in the "chromat_dir" directory, writing the sequences to "seqs_fasta" and the quality values to "seqs_fasta.qual". We recommend that you not use the trim option. Inaccurate bases called near the ends of the traces will not interfere with proper assembly. Subsequently you should screen out the vector in the sequences in "seqs_fasta" using cross_match: $ cross_match seqs_fasta vector.seq -minmatch 12 -minscore 20 -screen > screen.out which generates the screened sequence file "seqs_fasta.screen", and move "seqs_fasta.qual" to "seqs_fasta.screen.qual" using the command $ mv seqs_fasta.qual seqs_fasta.screen.qual Run phrap to perform the sequence assembly as follows: $ phrap seqs_fasta.screen -ace > phrap.out Phrap writes the the assembled contigs to the file "seqs_fasta.screen.contigs", and creates a .ace file that can be used for importing the assembly to xbap, CONSED, or ace-mbly for editing. Refer to the file "phrap.doc", which is part of the phrap distribution, for information on cross_match and phrap. End: PHRED.DOC