/***************************************************************************** # Copyright (C) 1993-1996 by Phil Green. # All rights reserved. # # This software is part of a beta-test version of the swat/cross_match/phrap # package. It should not be redistributed or # used for any commercial purpose, including commercially funded # sequencing, without written permission from the author and the # University of Washington. # # This software is provided ``AS IS'' and any express or implied # warranties, including, but not limited to, the implied warranties of # merchantability and fitness for a particular purpose, are disclaimed. # In no event shall the authors or the University of Washington be # liable for any direct, indirect, incidental, special, exemplary, or # consequential damages (including, but not limited to, procurement of # substitute goods or services; loss of use, data, or profits; or # business interruption) however caused and on any theory of liability, # whether in contract, strict liability, or tort (including negligence # or otherwise) arising in any way out of the use of this software, even # if advised of the possibility of such damage. # #*****************************************************************************/ SWAT, CROSS_MATCH Swat and cross_match are programs for rapid protein and nucleic acid sequence comparison and database search, based on an efficient implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman 1981, Gotoh 1982). By revising the recursion relations slightly and making efficient use of word-packing, we have been able to significantly reduce the number of machine instructions executed per Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3 fold over the previous best implementation (Pearson and Miller, 1992). Swat is about 1 / 10 as fast as BLAST, making it quite adequate for performing database searches on a workstation. Statistical significance of the hits is evaluated based on the empirical score distribution for the search, taking into account dependence of the score variance and mean on the length of the database sequence. Cross_match uses the same algorithm as swat, but in addition allows the comparison of a pair of sequences to be constrained to bands of the Smith-Waterman matrix that surround one or more matching words in the sequences. This substantially increases speed for large-scale nucleotide sequence comparisons without significantly compromising sensitivity. We have found cross_match useful for the following tasks: comparing a set of reads to a set of vector sequences to produce vector-masked versions of the reads; comparing contig sequences found by two alternative assembly procedures (e.g. phrap and xbap) to each other; and comparing assembled contigs to the final edited cosmid sequence, in retrospective studies of the accuracy of fully automated assembly. We also now routinely use cross_match to find repeats in finished cosmid sequences and mask them out prior to database searches. Swat and cross_match can also be used to search a profile against a database. PHRAP Phrap is a program for shotgun sequence assembly. Key features include: --Use of data quality information, both direct (from phred trace analysis) and indirect (from pairwise read comparisons), to delineate the likely accurate base calls in each read; this helps discriminate repeats, permits use of the full reads in assembly, and allows a highly accurate consensus sequence to be generated. A probability of error is computed for each consensus sequence position, which can be used to focus human editing on particular regions, to automate decision-making about where additional data are needed, and to provide users of the final sequence with information about local variations in quality. --The full length of each read (or more precisely, as much as is accurate enough to align against other reads) is used in the assembly. Other programs typically require that reads be trimmed fairly conservatively to remove all low-quality data, since the algorithms are such that inaccurate data would tend to cause correct overlaps to be rejected; this necessitates excluding a significant amount of useful (if slightly less accurate) sequence from the initial assembly, which then must be brought in manually later to make additional contig joins and help resolve ambiguities. Our approach minimizes or eliminates the need to make manual joins, and perhaps more importantly, by using all available data it increases the chance of correctly distinguishing repeats during assembly. Use of full reads does complicate analysis somewhat due to the high error rate towards the ends of the reads and the fact that read chimerism is more likely to occur there; thus it is essential to use methods that allow for this. In particular, we use local (Smith-Waterman) alignments for the pairwise read comparisons, rather than global (Needleman-Wunsch). -- There is no masking of repeats from known families, or other use of biological assumptions about the sequence. -- Aberrant data, including likely deletion and chimeric reads, and unremoved sequencing vector, are automatically identified, and treated specially to avoid causing problems with assembly. -- The contig sequence is constructed as a mosaic of the highest quality parts of the reads, rather than as a ``consensus''. This avoids both the complex algorithmic issues associated with multiple alignment methods, and problems that sometimes occur with these methods causing the consensus to be less accurate than individual reads at some positions. The sequence produced by phrap is in fact quite accurate, with less than 1 error per 10 kb in typical datasets. -- Information to provide indications of accuracy and uniqueness of assembly is generated. Possible sites where misassembly may have occurred, or where additional data may be required are flagged. -- Large datasets can be assembled efficiently. On a Dec Alpha 3000/900 workstation (comparable in speed to a 200 Mhz Pentium PC), cosmid datasets typically require an average of 2 minutes, and human BACs 10 min. to several hours. A 36,000 read bacterial genome (assembled at Genome Therapeutics, Inc.) required < 2 hrs. The main factor limiting speed of assembly is the number of repeats in the genomic target sequence, rather than the total number of reads, since repeats disproportionately increase the number of Smith-Waterman comparisons that need to be performed. PHRAPVIEW Phrapview is a graphical tool that provides a "global" view of the phrap assembly, complementary to the "local" view provided by the sequence editor CONSED. It is written in perl/Tk; to run it you will need to have installed on your system a recent version of perl (perl5 or later) that includes the Tk library. This is available for free from a number of web sites. Phrapview is invoked with the command phrapview filename where filename is the name of a ".view" file that was created by running phrap with the -view option. N.B. The format of the .view file is still under development. To ensure that phrapview will read correctly the .view file produced by phrap, make sure that both programs (phrap and phrapview) were part of the same release. The following information can be displayed: (i) Basic information concerning the numbers of reads, singletons, contigs, chimeras, etc. Each contig, and each chimeric read, is represented by a horizontal black line proportional in length to the contig or read size. Note that although they are shown separately in this display, chimeric reads in fact generally are incorporated by phrap into one of the contigs, in a region that corresponds to one of the two chimeric pieces; the location of this is indicated by the black "chimera match" line connecting the chimera to a contig (see below). The following information requires pressing an appropriate "button" to display: (ii) Depth of coverage. Graphs (in yellow) above and below the contig line indicate the depth of coverage (i.e. number of reads aligned against the contig) on each strand at each position; the horizontal orange lines correspond to a depth of 10. In addition to the ordinary depth of coverage (which counts all reads in the contig, and is mainly useful to indicate regions where an abnormal deficit or excess occurs), the "reduced depth" can be displayed. This counts only reads that are non-chimeric, have a positive LLR score against the contig, and have no positive LLR scoring match to a read elsewhere in the assembly -- i.e. the "confidently placed" reads. (See phrap.doc for a description of LLR scores). Misjoins usually occur in places where the reduced depth is 0 on both strands. Regions of the contig where the depth is 0 on one or both strands are indicated in red. (iii) "Matches" and "links". "Matches" are pairwise read matches between reads in different parts of the assembly, and are indicated by a single (curved or straight) line connecting the midpoints of the two matching regions. Contig matches (between two non-chimeric reads in the assembled contigs) and chimera matches (between a chimeric read and itself or another read) can be displayed separately. "Links" connect the startpoints of two reads that derive from the same subclone: forward-reverse links correspond to reads from opposite ends of a subclone insert, while same-strand links correspond to walking or duplicated reads in the same direction on the insert. Links will only be indicated properly if the read naming convention assumed by phrap is used (see phrap documentation). Each match or link is considered to be in one of three classes, indicated by color: "problems" (red), which indicate serious discrepancies or a significant possibility of incorrect, incomplete, or non-unique assembly; "ok" (black), which tend to confirm the assembly or are otherwise consistent with it; and "grayzone" (blue or green), which may in some cases indicate problems but are probably ok. Links are considered "ok" if the two read starts meet the expected spacing and orientation constraints, otherwise they are marked as "problems" (in many cases however these reflect subclone tracking errors rather than assembly errors). Matches are considered "ok", and are not shown, if they are between reads assembled in the same location, or have LLR scores below the cutoff. Matches are considered "problems" if they have LLR scores above the cutoff, are non-local, and involve regions where the reduced depth is 0. "Problem" matches between or within large contigs often indicate the presence of near-perfect repeats, which may or may not have been misassembled; or (in some cases) a join that was missed. "Problem" matches between a small (e.g. singleton) contig and a large one typically represent cases where a read could not be assembled into its proper location by phrap, due generally to some anomaly involving it (e.g. an internal region of low quality data). Non-local matches with LLR scores above the cutoff which do not lie in regions of reduced depth 0 are considered to be "grayzone". Usually these involve isolated pairs of reads each of which extends part way into a repeat, such that the overlapping region is too small to contain high-quality discrepancies. The fact that the reduced depth is non-zero means that there are other reads in the same region which do not have any positive LLR scores to the other region, so it is unlikely that there is a significant misassembly in such cases, although it is possible that one of the two matching reads may have been incorrectly placed. Chimeras generally have one "ok" match (to the site where the read was assembled) and one or more "problem" matches to the region from which the other part of the chimera derives (sometimes these matches all have negative LLR scores and are thus not shown with the default LLR cutoff). Some chimeras actually represent subclones with internal deletions; these tend to be obvious since the two pieces map to nearby locations in some contig and are in the same orientation there. Double-headed arrows on matches indicate that the matching regions are on opposite strands; double-headed arrows on links indicate that the orientation of the two reads is inconsistent with the read names (i.e. forward-reverse pairs that point in the same direction, or same strand pairs that point in opposite directions; these are colored as "problems" if the reads are in the same contig). (iv) Low quality contig bases (bases having phrap qualities that are lower than a specified threshold), and high quality discrepancies (positions where there is an aligned read discrepant with the contig base and having a phrap quality that exceeds the threshold). A horizontal green line above the contig indicates the quality threshhold value ("qual cutoff"). Low quality contig bases are indicated by a vertical blue line drawn from a height corresponding to the contig quality value, up to the threshhold line; high quality discrepancies are indicated by a vertical blue line drawn from a height corresponding to the discrepant read's quality, down to the threshhold line. Thus in each case longer lines are more "serious" (reflect a larger deviation from the threshhold, in the wrong direction). Passing the mouse pointer (slowly!) over an item of types ii-iv above results in its being highlighted (converted to yellow); it remains highlighted until the pointer is moved over another item, or a contig line. Textual information about the highlighted item is shown in a box in the top right region of the display. Several parameters can be changed by entering new values in appropriate boxes at the bottom of the display. Four different magnifications can be changed: horizontal magnification; the "spacing magnification" which determines the vertical spacing between contigs; and the depth and quality magnifications, which affect the scales used for coverage depth and quality. Values of each of these are specified as percents of the original (startup) magnification. The role of LLR cutoff, and qual cutoff are described above. The unalign parameter refers to the size of the largest segment involved in the pairwise read match that is unaligned against the contig sequence. If large, such segments may indicate a misplaced read or (occasionally) an otherwise undetected chimera. Min fwd-rev and max fwd-rev refer to the minimum and maximum allowed separation (in bp) between (the starts of) forward-reverse read pairs; i.e. these represent the minimum and maximum allowed subclone insert size. Separations outside this range (or in the wrong orientation) are marked as problems. Min ss and max ss are the minimum and maximum spacing between same-strand reads (i.e. size of a walking step). The display is updated to reflect changes made in the above parameters only after a relevant button (Show Contig Matches, etc.) is pressed. Colors can be changed by altering the variable definitions that occur near the beginning of the phrapview source file, and re-running the program.