Phrap package over view

/*****************************************************************************
#   Copyright (C) 1993-1996 by Phil Green.                          
#   All rights reserved.                           
#                                                                           
#   This software is part of a beta-test version of the swat/cross_match/phrap 
#   package.  It should not be redistributed or
#   used for any commercial purpose, including commercially funded
#   sequencing, without written permission from the author and the
#   University of Washington.
#   
#   This software is provided ``AS IS'' and any express or implied
#   warranties, including, but not limited to, the implied warranties of
#   merchantability and fitness for a particular purpose, are disclaimed.
#   In no event shall the authors or the University of Washington be
#   liable for any direct, indirect, incidental, special, exemplary, or
#   consequential damages (including, but not limited to, procurement of
#   substitute goods or services; loss of use, data, or profits; or
#   business interruption) however caused and on any theory of liability,
#   whether in contract, strict liability, or tort (including negligence
#   or otherwise) arising in any way out of the use of this software, even
#   if advised of the possibility of such damage.
#                                                                         
#*****************************************************************************/

SWAT, CROSS_MATCH
 Swat and cross_match are programs for rapid protein and nucleic acid
sequence comparison and database search, based on an efficient
implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman
1981, Gotoh 1982).  By revising the recursion relations slightly and
making efficient use of word-packing, we have been able to
significantly reduce the number of machine instructions executed per
Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3
fold over the previous best implementation (Pearson and Miller, 1992).
Swat is about 1 / 10 as fast as BLAST, making it quite adequate for
performing database searches on a workstation.  Statistical
significance of the hits is evaluated based on the empirical score
distribution for the search, taking into account dependence of the
score variance and mean on the length of the database sequence.

 Cross_match uses the same algorithm as swat, but in addition allows
the comparison of a pair of sequences to be constrained to bands of
the Smith-Waterman matrix that surround one or more matching words in
the sequences. This substantially increases speed for large-scale
nucleotide sequence comparisons without significantly compromising
sensitivity.  We have found cross_match useful for the following
tasks: comparing a set of reads to a set of vector sequences to
produce vector-masked versions of the reads; comparing contig
sequences found by two alternative assembly procedures (e.g.  phrap
and xbap) to each other; and comparing assembled contigs to the final
edited cosmid sequence, in retrospective studies of the accuracy of
fully automated assembly.  We also now routinely use cross_match to
find repeats in finished cosmid sequences and mask them out prior to
database searches.

 Swat and cross_match can also be used to search a profile against a
database.

PHRAP
Phrap is a program for shotgun sequence assembly.  Key features
include:
--Use of data quality information, both direct (from phred trace
analysis) and indirect (from pairwise read comparisons), to delineate
the likely accurate base calls in each read; this helps discriminate
repeats, permits use of the full reads in assembly, and allows a
highly accurate consensus sequence to be generated.  A probability of
error is computed for each consensus sequence position, which can be
used to focus human editing on particular regions, to automate
decision-making about where additional data are needed, and to provide
users of the final sequence with information about local variations in
quality.

--The full length of each read (or more precisely, as much as is
accurate enough to align against other reads) is used in the assembly.
Other programs typically require that reads be trimmed fairly
conservatively to remove all low-quality data, since the algorithms
are such that inaccurate data would tend to cause correct overlaps to
be rejected; this necessitates excluding a significant amount of
useful (if slightly less accurate) sequence from the initial assembly,
which then must be brought in manually later to make additional contig
joins and help resolve ambiguities. Our approach minimizes or
eliminates the need to make manual joins, and perhaps more
importantly, by using all available data it increases the chance of
correctly distinguishing repeats during assembly.  Use of full reads
does complicate analysis somewhat due to the high error rate towards
the ends of the reads and the fact that read chimerism is more likely
to occur there; thus it is essential to use methods that allow for
this. In particular, we use local (Smith-Waterman) alignments for the
pairwise read comparisons, rather than global (Needleman-Wunsch).

-- There is no masking of repeats from known families, or other use of 
biological assumptions about the sequence.

-- Aberrant data, including likely deletion and chimeric reads, and
unremoved sequencing vector, are automatically identified, and treated
specially to avoid causing problems with assembly.

-- The contig sequence is constructed as a mosaic of the highest
quality parts of the reads, rather than as a ``consensus''.  This
avoids both the complex algorithmic issues associated with multiple
alignment methods, and problems that sometimes occur with these methods
causing the consensus to be less accurate than individual reads
at some positions. The sequence produced by phrap is in fact quite
accurate, with less than 1 error per 10 kb in typical datasets.

-- Information to provide indications of accuracy and uniqueness of
assembly is generated. Possible sites where misassembly may have
occurred, or where additional data may be required are flagged.

-- Large datasets can be assembled efficiently. On a Dec Alpha
3000/900 workstation (comparable in speed to a 200 Mhz Pentium PC),
cosmid datasets typically require an average of 2 minutes, and human
BACs 10 min. to several hours.  A 36,000 read bacterial genome
(assembled at Genome Therapeutics, Inc.) required < 2 hrs.  The main
factor limiting speed of assembly is the number of repeats in the
genomic target sequence, rather than the total number of reads, since
repeats disproportionately increase the number of Smith-Waterman
comparisons that need to be performed.



PHRAPVIEW 

Phrapview is a graphical tool that provides a "global" view of the
phrap assembly, complementary to the "local" view provided by the
sequence editor CONSED. It is written in perl/Tk; to run it you will
need to have installed on your system a recent version of perl (perl5
or later) that includes the Tk library. This is available for free
from a number of web sites. 

 Phrapview is invoked with the command

  phrapview filename 

where filename is the name of a ".view" file that was created by
running phrap with the -view option. 	

N.B. The format of the .view file is still under development. To
ensure that phrapview will read correctly the .view file produced by
phrap, make sure that both programs (phrap and phrapview) were part of
the same release.

The following information can be displayed:

(i) Basic information concerning the numbers of reads, singletons,
contigs, chimeras, etc. Each contig, and each chimeric read, is
represented by a horizontal black line proportional in length to the
contig or read size.  Note that although they are shown separately in this
display, chimeric reads in fact generally are incorporated by phrap
into one of the contigs, in a region that corresponds to one of the
two chimeric pieces; the location of this is indicated by the black
"chimera match" line connecting the chimera to a contig (see below).

The following information requires pressing an appropriate "button" to
display:

(ii) Depth of coverage.  Graphs (in yellow) above and below the contig
line indicate the depth of coverage (i.e. number of reads aligned
against the contig) on each strand at each position; the horizontal
orange lines correspond to a depth of 10. In addition to the ordinary
depth of coverage (which counts all reads in the contig, and is mainly
useful to indicate regions where an abnormal deficit or excess
occurs), the "reduced depth" can be displayed. This counts only reads
that are non-chimeric, have a positive LLR score against the contig,
and have no positive LLR scoring match to a read elsewhere in the
assembly -- i.e. the "confidently placed" reads. (See phrap.doc for a
description of LLR scores). Misjoins usually occur in places where the
reduced depth is 0 on both strands.
 Regions of the contig where the depth is 0 on one or both strands are
indicated in red.

(iii) "Matches" and "links". "Matches" are pairwise read matches
between reads in different parts of the assembly, and are indicated by
a single (curved or straight) line connecting the midpoints of the
two matching regions.  Contig matches (between two non-chimeric reads
in the assembled contigs) and chimera matches (between a chimeric read
and itself or another read) can be displayed separately. "Links"
connect the startpoints of two reads that derive from the same
subclone: forward-reverse links correspond to reads from opposite ends
of a subclone insert, while same-strand links correspond to walking or
duplicated reads in the same direction on the insert.  Links will only
be indicated properly if the read naming convention assumed by phrap
is used (see phrap documentation).
 Each match or link is considered to be in one of three classes,
indicated by color: "problems" (red), which indicate serious
discrepancies or a significant possibility of incorrect, incomplete,
or non-unique assembly; "ok" (black), which tend to confirm the
assembly or are otherwise consistent with it; and "grayzone" (blue or
green), which may in some cases indicate problems but are probably ok.
Links are considered "ok" if the two read starts meet the expected
spacing and orientation constraints, otherwise they are marked as
"problems" (in many cases however these reflect subclone tracking
errors rather than assembly errors).  Matches are considered "ok", and
are not shown, if they are between reads assembled in the same
location, or have LLR scores below the cutoff. Matches are considered
"problems" if they have LLR scores above the cutoff, are non-local,
and involve regions where the reduced depth is 0.  "Problem" matches
between or within large contigs often indicate the presence of
near-perfect repeats, which may or may not have been misassembled; or
(in some cases) a join that was missed.  "Problem" matches between a
small (e.g. singleton) contig and a large one typically represent
cases where a read could not be assembled into its proper location by
phrap, due generally to some anomaly involving it (e.g. an internal
region of low quality data).  Non-local matches with LLR scores above
the cutoff which do not lie in regions of reduced depth 0 are
considered to be "grayzone". Usually these involve isolated pairs of
reads each of which extends part way into a repeat, such that the
overlapping region is too small to contain high-quality discrepancies.
The fact that the reduced depth is non-zero means that there are other
reads in the same region which do not have any positive LLR scores to
the other region, so it is unlikely that there is a significant
misassembly in such cases, although it is possible that one of the two
matching reads may have been incorrectly placed.
 Chimeras generally have one "ok" match (to the site where the read
was assembled) and one or more "problem" matches to the region from
which the other part of the chimera derives (sometimes these matches
all have negative LLR scores and are thus not shown with the default
LLR cutoff). Some chimeras actually represent subclones with internal
deletions; these tend to be obvious since the two pieces map to nearby
locations in some contig and are in the same orientation there.
 Double-headed arrows on matches indicate that the matching regions
are on opposite strands; double-headed arrows on links indicate that
the orientation of the two reads is inconsistent with the read names
(i.e. forward-reverse pairs that point in the same direction, or same
strand pairs that point in opposite directions; these are colored as
"problems" if the reads are in the same contig).

(iv) Low quality contig bases (bases having phrap qualities that are
lower than a specified threshold), and high quality discrepancies
(positions where there is an aligned read discrepant with the contig
base and having a phrap quality that exceeds the threshold). A
horizontal green line above the contig indicates the quality
threshhold value ("qual cutoff"). Low quality contig bases are
indicated by a vertical blue line drawn from a height corresponding to
the contig quality value, up to the threshhold line; high quality
discrepancies are indicated by a vertical blue line drawn from a
height corresponding to the discrepant read's quality, down to the
threshhold line.  Thus in each case longer lines are more "serious"
(reflect a larger deviation from the threshhold, in the wrong
direction).

  Passing the mouse pointer (slowly!) over an item of types ii-iv
above results in its being highlighted (converted to yellow); it
remains highlighted until the pointer is moved over another item, or a
contig line.  Textual information about the highlighted item is shown
in a box in the top right region of the display.
 Several parameters can be changed by entering new values in
appropriate boxes at the bottom of the display. Four different
magnifications can be changed: horizontal magnification; the "spacing
magnification" which determines the vertical spacing between contigs;
and the depth and quality magnifications, which affect the scales used
for coverage depth and quality. Values of each of these are specified
as percents of the original (startup) magnification.  The role of LLR
cutoff, and qual cutoff are described above.  The unalign parameter
refers to the size of the largest segment involved in the pairwise
read match that is unaligned against the contig sequence. If large,
such segments may indicate a misplaced read or (occasionally) an
otherwise undetected chimera.  Min fwd-rev and max fwd-rev refer to
the minimum and maximum allowed separation (in bp) between (the starts
of) forward-reverse read pairs; i.e. these represent the minimum and
maximum allowed subclone insert size.  Separations outside this range
(or in the wrong orientation) are marked as problems. Min ss and max
ss are the minimum and maximum spacing between same-strand reads (i.e.
size of a walking step).
 The display is updated to reflect changes made in the above
parameters only after a relevant button (Show Contig Matches, etc.) is
pressed.

 Colors can be changed by altering the variable definitions that occur
near the beginning of the phrapview source file, and re-running the
program.