Phrap Documentation

/*****************************************************************************
# Copyright (C) 1994-1996 by Phil Green.
# All rights reserved.
#
# This software is part of a beta-test version of the swat/cross_match/phrap
# package. It should not be redistributed or
# used for any commercial purpose, including commercially funded
# sequencing, without written permission from the author and the
# University of Washington.
#
# This software is provided ``AS IS'' and any express or implied
# warranties, including, but not limited to, the implied warranties of
# merchantability and fitness for a particular purpose, are disclaimed.
# In no event shall the authors or the University of Washington be
# liable for any direct, indirect, incidental, special, exemplary, or
# consequential damages (including, but not limited to, procurement of
# substitute goods or services; loss of use, data, or profits; or
# business interruption) however caused and on any theory of liability,
# whether in contract, strict liability, or tort (including negligence
# or otherwise) arising in any way out of the use of this software, even
# if advised of the possibility of such damage.
#
#*****************************************************************************/

Documentation for:

phrap ("phragment assembly program", or "phil's revised assembly
program") -- a program for assembling shotgun DNA sequence data. Key
features: allows use of entire read (not just highest quality part);
uses a combination of user-supplied and internally computed data
quality information to improve accuracy of assembly in the presence of
repeats; constructs contig sequence as a mosaic of the highest quality
parts of reads (rather than a consensus).
N.B. phrap does not provide editing or viewing capabilities; these are
available with consed or phrapview.

cross_match -- a general-purpose utility (based on SWAT, an efficient
implementation of the Smith-Waterman algorithm) for comparing any two
sets of (long or short) DNA sequences. For example, it can be used to
compare a set of reads to a set of vector sequences and produce
vector-masked versions of the reads; a set of cDNA sequences to a set
of cosmids; contig sequences found by two alternative assembly
procedures (e.g. phrap and xbap) to each other; or phrap contigs to
the final edited cosmid sequence. It is slower but more sensitive
(because it allows gaps) than BLASTN.

I) INPUT FILES are of two types: sequence files (required), and
quality files (optional).

Ia) Sequence files. With both phrap and cross_match, either a single
sequence file or multiple sequence files can be specified. If a single
file is specified, all sequences in it are compared to each other; if
more than one file is specified, sequences in the first file are
compared to sequences in the second (and subsequent) files. Typically
phrap is used with a single input sequence file containing the reads,
unless one wants to assemble one set of reads "against" another
sequence or set of sequences (see below). Phrap is designed to be
able to use all of each read sequence in the assembly, not just the
trimmed (highest quality) part, so the full sequence of each read
should be provided. Cross_match is generally used with two input
sequence files, the first containing the "query" sequences and the
second containing the "subject" sequences.

Sequence files must be in FASTA format: each sequence should have a
single header line with leading character '>' followed immediately
(i.e. with no intervening spaces) by the read name, and then
(optionally), on the same line, a space or spaces followed by
descriptive information about the read; the header line is followed by
a separate line or lines containing the sequence. The read names,
descriptive information, and sequence can be of arbitrary length (up
to a combined total size of about 1 Gb), since space for them is
allocated dynamically.

Read names: It is assumed by phrap that the part of the read name up
to the first '.', if any, is the name of the subclone from which the
read is derived; this is used, for example, in checking for chimeric
reads (for which purpose it is important to know when two reads come
from the same subclone, since another read from the same subclone
should not be regarded as independent confirmation of a join). IT IS
IMPORTANT THAT DIFFERENT READS FROM THE SAME CLONE BE GIVEN NAMES THAT
ADHERE TO THIS CONVENTION -- OTHERWISE THEY WILL BE INTERPRETED
(INCORRECTLY) TO CONFIRM EACH OTHER, WHICH MAY ADVERSELY AFFECT
ASSEMBLY. Orientation of the read (in cases where data is acquired
from each end of a subclone insert) is assumed to be indicated by the
first letter following the '.', namely 's' or 'x' for forward reads,
and 'r' or 'y' for reverse reads. At present the read orientation is
used only for consistency checking following assembly, and not in the
assembly itself.
In addition, it is assumed that dye terminator reads are indicated by
an 'x' (for forward reads) or 'y' (for reverse reads) following the
'.', and that 's' and 'r' refer to dye primer reads. This is used in
the quality adjustment procedure (see below), where confirmation of a
non-dye-terminator read by a terminator read counts as significantly
as confirmation by an opposite-strand read.
If the read name ends in ".seq" the .seq is removed.

Descriptive information on the header line is not used by the program
at present, although it is read in, and is written back out to the .ace and
.singlets files.

Sequence: Non-alphabetic characters (including '*') in the sequence
lines are automatically stripped out when the file is read in, except
for '-' which is converted to 'N'; '>' must not appear anywhere within
the sequence since it is assumed to start the header of a new sequence
(even if it is not at the beginning of a line). Lower case letters are
converted to upper case on readin, so it is no longer possible (as in
previous phrap versions) to use case as a crude indication of quality.

Vector screening: Following creation of a FASTA file with the raw
read sequences, it is important to remove or screen out sequencing
(subclone) vector sequence. (It is unnecessary to remove the cloning
(e.g. cosmid) vector, unless the inserts from multiple overlapping
cosmids are being assembled simultaneously; however it is generally
useful to remove at least the central part of the cosmid vector, since
this then provides a natural point at which phrap can break the
circular cosmid sequence into a linear one). Unremoved sequencing
vector may cause reads to be identified as "possible chimeras", or
otherwise interfere with proper assembly. Some vector removal
programs may be unreliable at identifying vector sequences that are
rearranged or found in the lower quality part of the read, and I
recommend that you screen for sequencing vector using cross_match,
whether or not you have already screened them using another program.
First, create a FASTA file, say "vector", with all of the vector
sequences you want to screen for (I use one that contains all vector
sequences used in any of the ongoing sequencing projects, in case the
vector is misidentified in the current project). If your read file is
named "C05D11.reads" (for example), then run the command

cross_match C05D11.reads vector -minmatch 12 -minscore 20 -screen > screen.out

(This uses somewhat more sensitive parameter settings than the
cross_match defaults). The '-screen' option causes a file named
"C05D11.reads.screen" to be created, containing "vector masked"
versions of the original sequences: i.e. any region that matches any
part of a vector sequence is replaced by X's. This ".screen" file
should be used in all subsequent analyses. The output from
cross_match (which has been redirected to the file screen.out in the
above example) lists the matches that were found (see below).

N.B. If you have created a .qual file for your reads (say
C05D11.reads.qual), be sure to rename or copy it to
C05D11.reads.screen.qual so that it will be recognized by phrap when
C05D11.reads.screen is used as the input FASTA file.

Ib) Input of quality information. For each input sequence file in
FASTA format, you may optionally include a corresponding file of data
quality information. This is strongly recommended since it greatly
improves the accuracy of assembly and of the consensus sequence. The
name of this file should consist of the name of the FASTA file, with
".qual" appended.
The format of the .qual file is similar to that of the corresponding
FASTA file. For each read there should appear a header line identical
(except possibly for the "description" field) to that in the FASTA
file. This is followed by one or more lines giving the qualities for
each base. Quality values should be integers between 0 and 99
(inclusive), and should be separated by spaces. No other (non-digit,
non-white space) characters should appear. The total number of quality
values for the read must match the number of bases that appear in the
FASTA file. Also, the orders of reads in the FASTA and corresponding
.qual file must be identical. Bases 'N' or 'X' are automatically
assigned quality 0, whether or not a .qual file is provided.
If provided, quality information is used by phrap both in the
assembly itself (where discrepancies between high quality bases are
used to discriminate repeats from true overlaps) and in determining
the contig consensus sequence (which is formed as a mosaic of the
highest quality parts of the reads). Because phrap generates its own
"quality" measures (based on read-read confirmation) it performs
reasonably well even if no input quality is provided, but when
available the input data quality information can substantially
increase the accuracy of the consensus, particularly in regions where
there are strand-specific effects on quality (e.g. compressions) since
in such cases the input quality information may constitute the only
basis for choosing one strand over the other.
If your sequencing data is from ABI 373 or 377 automated sequencers
I strongly recommend that you generate the read sequences using the
basecalling program phred, which will produce an appropriate quality
file for input to phrap. Phred generally gives more accurate base
calls overall, for a longer distance, than the ABI software does. On
the basis of the trace characteristics, phred computes a probability p
of an error in the base call at each position, and converts this to a
quality value q using the transformation q = -10 log_10(p). Thus a
quality of 30 corresponds to an error probability of 1 / 1000, a
quality of of 40 to an error probability of 1 / 10000, etc. The
quality value of 99 should be reserved for base calls that have been
visually inspected and verified as "highly accurate" (during editing),
and 98 for bases that have been edited but are not highly accurate
(these are converted to quality 0 in phrap). If two reads appear to
match but have discrepancies involving bases that have quality 99 in
both reads, phrap does not allow them to be merged during assembly.
At present this is the only way to break false joins made by phrap.
If you wish to generate your own quality values you should be
prepared to experiment. It is particularly important that possible
insertion errors (which are fairly frequent late in the trace with ABI
basecalls) receive quality 0, because in choosing the "consensus"
phrap tends to favor the reads with the highest total quality in a
given region. Also, if at all possible you should use quality values
that are related to error probabilities in the manner described above,
since to some extent phrap is tuned to expect these (and its output
qualities will then have a similar interpretation).
If all input quality values are relatively small (less than 15),
phrap assumes that they do not correspond to error probabilities and
attempts to rescale them so that the largest quality value is
approximately 30. This allows different input scales to be used.

Phrap adjusted quality values/error probabilities: Phrap computes
adjusted quality values for each read on the basis of read-read
confirmation information, as follows. If a read is confirmed by an
opposite-strand or different-chemistry (dye terminator vs. dye primer)
read at a given position, that position is given a quality which is
the sum of the two input read qualities; when more than one
opposite-strand read confirms a given position, only the single
highest quality from all opposite-strand matching reads is used. If
qualities are related to error probabilities as described above, this
procedure can be interpreted as computing an error probability for
each base which is the product of the error probabilities for the two
reads; since error profiles for opposite-strand or different chemistry
reads are essentially independent, this is reasonable, although it is
somewhat conservative in that it does not take into account
same-strand matches (which clearly should count for something,
although they certainly are not independent).
Contig positions are assigned quality values equal to the highest
adjusted quality of any read at that position, and then adjusted
downward to take into account any discrepancies with other reads. As a
result, when phred input qualities are used, the output phrap
qualities associated to each contig position have a natural
interpretation as (conservative) error probabilities. Such error
probabilities provide an extremely useful guide to where editing or
additional data collection is needed.
The phrap adjusted qualities are used in computing "LLR scores" for
each pairwise match between two reads. These scores take into account
the qualities of the base calls in the reads, and are (approximate)
log-likelihood ratios for comparing the hypothesis that the reads
truly overlap to the hypothesis that they are from 95% similar
repeats. The point is that discrepancies between overlapping reads are
due to base-calling errors and thus tend to occur in low-quality
bases, whereas reads from different repeats can have high-quality
discrepancies that are due to sequence differences between the
repeats; and the probability of the observed data under each
hypothesis can be quantified using the interpretation of the phrap
qualities in terms of error probabilities. A pairwise match tends to
have a positive LLR score if the two reads overlap, whereas it tends
to have a negative LLR score if the two reads are from different
repeats (unless the repeats are nearly identical).

II) RUNNING phrap. The command line should be of the form

phrap seq_file_name1 [seq_file_name2 ...] [-option value] [-option value] ...

The output is extensive and should be redirected to a file.

If more than one seq_file_name is provided, then reads in the first
file are assembled only against sequences in the 2d (and later) files;
no sequences are compared to other sequences in the same file. In this
case, many features of phrap (e.g. checking for anomalies, vector,
etc.) which rely on comprehensive read-read comparisons are turned
off, SWAT scores are used in place of LLR_scores, and implied merges
are not performed -- i.e. each read in the first file is merged with at
most one sequence in the 2d file. This usage of phrap is convenient
for aligning a set of reads against a known "finished" sequence in
order to look for polymorphisms, for example.

Example:

phrap C05D11.reads.screen -minmatch 12 -ace > phrap.out

The following options are available for both cross_match and phrap:

option name value default value

-penalty mismatch (substitution) penalty for SWAT comparisons -2
-gap_init gap initiating penalty for SWAT comparisons penalty - 2
-gap_ext gap extension penalty for SWAT comparisons penalty - 1
-minmatch minimum length of matching word to nucleate
SWAT comparison; 14
if minmatch = 0, a full (non-banded) comparison is done
-minscore minimum SWAT score 30
-bandwidth half band width for banded SWAT searches 14
(full width is 2 n + 1)
-matrix score matrix for SWAT comparisons (supersedes -penalty) [default has +1 for
matches, -penalty for mismatches]
-raw use raw rather than complexity-adjusted Smith-Waterman scores.
(default is to adjust scores to reflect complexity of matching sequence).
(This is a flag; no value should be provided).
-tags tags selected output lines, to facilitate parsing
(This is a flag; no value should be provided).
-indexwordsize size of indexing (hashing) words in words.c. The value of 10
this parameter has a (small) effect on run time and memory
usage.

The following options apply only to phrap:

-maxgap maximum gap (unmatched region) allowed in merging contigs 30
-trim_start no. of bases to be removed at beginning of each read 0
-vector_bound no. of bases at beginning of each read, matches within which
are assumed to be vector 60
-trim_penalty penalty used for identifying degenerate sequence
at beginning/end of read -2
-trim_score minimum score for identifying degenerate sequence
at beginning/end of read 20
-confirm_length minimum size of confirming segment (starts at 3d distinct 8
nuc following discrepancy)
-confirm_trim amount by which confirming segments are trimmed at edges 1
-confirm_penalty penalty used in aligning against "confirming" reads -5
-confirm_score minimum alignment score for a read to
be allowed to "confirm" part of another read 30
-qual_show LLR cutoff for flagging "low_quality" regions in 20
contig sequence and "high quality" discrepancies
between read and contig. Bases in the .contigs,
and .ace file are lower case iff their LLR-converted quality is below this value.
-node_seg minimum segment size (for purposes of traversing weighted directed graph) 8
-node_space spacing between nodes (in weighted directed graph) 4
-revise_greedy splits initial greedy assembly into pieces at "weak joins", and then tries
to reattach them to give higher overall score.
(This is a flag; no value should be provided).
-ace create ".ace" file with the assembly information padded reads, etc.
(This is a flag; no value should be provided).
-view create ".view" file suitable for input to phrapview
(This is a flag; no value should be provided).

The following options apply only to cross_match:

-masklevel controls which matches (with scores at least minscore) are 80
displayed. A match is not displayed if the percentage of
aligned query bases in the match that are contained within
the aligned bases of higher-scoring matches is masklevel or higher.
For example:
if masklevel = 0, only the single highest scoring match for
each query is displayed
100 any match that is not completely contained
within a higher scoring match is displayed
101 all matches are displayed.
-alignments display the alignment for each match (Warning -- may produce huge
amount of output!)
(This is a flag; no value should be provided).
-discrep_lists give list of discrepancies for each match
(This is a flag; no value should be provided).
-discrep_tables give table of discrepancies (by quality) for each match
(default is to display a single table, that combines data for
all matches)
(This is a flag; no value should be provided).
-screen create a ".screen" file in which regions in the first input sequence file
that match some region in the second (or later) file are replaced by X's.
(This is a flag; no value should be provided).

Normally you should just use the default values of the parameters; I
haven't systematically explored optimal values for these, but the
current ones appear to work reasonably well. To increase (potential)
sensitivity of the pairwise comparisons, at the cost of a longer run
time, try decreasing minmatch, minscore, and/or penalty, and
increasing bandwidth.

Phrap output files (with the exception of the standard output, all
output files are given names consisting of the input read file, with
an appropriate suffix (given below) appended):

1) .singlets file (named C05D11.reads.screen.singlets in the above
example): A FASTA file containing the singlet sequences (i.e. the
singleton reads with no match to any other read).

2) .contigs file (named C05D11.reads.screen.contigs in the above
example): A FASTA file containing the sequences (without pads)
of all contigs (upper case <=> quality is >= qual_show). These
include singleton contigs consisting of single reads with a match to
some other contig, but that couldn't be merged consistently with it.
Thus reads that would go into xbap's "failed file" instead are
considered separate contigs by phrap and appear with the other
contigs.

3) .contigs.qual file: The corresponding quality file (see above for
description of how phrap computes qualities for contig positions).

4) .log file: A summary of how the assembly proceeded, and how the
contig sequences were constructed as mosaics of the individual reads.
This is probably only of interest to me, for trouble-shooting
purposes.

5) .ace file: (Produced only when the -ace option is used). A ".ace"
formatted file containing (i) the sequences of the padded contigs and
reads; (ii) the locations of the reads with respect to the contig
sequence; (iii) information about how the contig sequence is pieced
together as a mosaic of the high quality parts of the reads; (iv)
an indication of where the reads can be "clipped" to remove low
quality sequence at either end; (v) qualities of the bases in the contig sequence.
Additional information will be added in the future.
Sequence information is indicated by a header line consisting of the
word DNA followed by the name of the sequence, followed by lines
giving the sequence. Sequences are padded (with a '*' character) so
that reads align directly against contigs, i.e. the alignments can be
reconstructed exactly using only the information in this file. The end
of the sequence is followed by a blank line.
For reads in reverse orientation with respect to the contig, the
sequence of the complement is given instead of the original sequence,
and the suffix ".comp" is appended to the name of the read.
The case (upper or lower) of bases may differ from that in
C05D11.reads.screen, reflecting adjustment of quality values by phrap
(upper case <=> LLR_quality is at least qual_show).

Examples:

DNA x72g12.s1.comp
ccaatgtgacttggataatggcgtaaggcttgactttgacgtcccagtt*
aa*ccctg*ctt*aaact*accttc*c*tta*tcctaaac*attaac*tt
cttagctcaa*t*cccattcca*gcgaatcttca*aactgtcaaactana
*tcgt*gaaatc*tcctccTGCTAGATtc*tcaaATCTGTCAATTGAAGA
TCGTAGTTACTTGTTCCTCTACACAGATTTGCTTTTTGAATCTCCTGCCA
TGATCGATGGAGTCTTGACATCTGCTGATGATGTTGCCAAACATTTCAC*
GAAAGAT*CTCATTGATCACTCTATTCAAGT*GGGAGTC*TCCGGTCTTT
ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA
TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC
TTCAAGAATTCATCAATGTGCTCAAAAGTTGGCTGGAGAAGCTCGTGATC
GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT
TATGGAAAGAGTATGTTATTt

DNA Contig35
GATAAGataATAAtGGAAAATaGAAaccGGAaAaATaATAAaATaaTTTc
aGATcGcTGaAGAaGAaGaGAAGaGAATAGcAGccCaATGTGAGAAGCTC
GGcAAAAAAGGACTCGAAGaaGcGGGAAAGAGTCTgGAAGcTGCCATTCT
TGAGAACACTGcCAATcATcCaaGTGCTGAACTTTTGGATCAATTGATCG
TCAAAGATcTTGAAGCTTTTGATCGTTTCCCAGTTCAATCCCTGACTTCA
AACTCAcCTTCTC*TTACTCCTCAACAATCAACTTTCTTAGCTCAATT*C
CCATTCCACGCGAATCTTCACAACTGTCCAACTAAATTCGtGGAAATCTT

Clipping information for a read is indicated by a line consisting of
the word "Sequence" followed by the name of the read, followed by two
lines that each indicate the beginning and ending bases that remain
after low quality (quality 0) sequence is removed from each end of the sequence.
The line starting with the word "Clipping" gives the base coordinates with
respect to the unpadded sequence, while the one starting with
"Clipping*" refers to the padded sequence. If (and only if) the input
fasta file included a "description" field (on the header line), that
appears as a separate "Description" line following the clipping
information.

Example:

Sequence x72g12.s1.comp
Clipping 153 548
Clipping* 170 570
Description ALF run #5

The assembly information is indicated by a line consisting of the word
"Sequence" followed by the name of the contig sequence, followed by
lines giving the positions of the reads in this contig. For each read
there are two lines: one starting with the word "Assembled_from" and
giving the name of the read together with its starting and ending
positions with respect to the unpadded contig sequence; and one
starting with the word "Assembled_from*" which instead has positions
with respect to the padded contig sequence. (In both cases the
starting and ending positions are those of the first and last base of
the read, not those of the "clipped" part). Note that it is possible
for a read to overhang the end of the contig sequence, so that start
positions may be negative and end positions may be greater than the
number of the last base in the contig. Information about how the
contig sequence is pieced together is given in the form of
"Base_segment" lines: each such line gives the starting and ending
positions of the segment within the contig sequence, the name of the
read from which the sequence is taken, and the starting and ending
positions of the corresponding segment within that read.

Example:

Sequence Contig35
Assembled_from z14f12.s1 1 596
Assembled_from* z14f12.s1 1 602
Assembled_from z13f5.s1 58 566
Assembled_from* z13f5.s1 58 572
Assembled_from z25a11.s1 115 738
Assembled_from* z25a11.s1 115 744
Assembled_from z14d4.s1.comp 145 741
Assembled_from* z14d4.s1.comp 145 747
Base_segment* 350 412 z26g6.s1 70 132
Base_segment 343 347 z13h11.s1 119 123
Base_segment* 345 349 z13h11.s1 121 125
Base_segment 339 342 z26g6.s1 60 63
Base_segment* 341 344 z26g6.s1 61 64
Base_segment 332 338 z13f5.s1 275 281
Base_segment* 334 340 z13f5.s1 277 283
Base_segment 331 331 z26g6.s1 52 52
Base_segment* 333 333 z26g6.s1 53 53
Base_segment 222 330 z25a11.s1 108 216
Base_segment* 222 332 z25a11.s1 108 218
Base_segment 166 221 z13f5.s1 109 164
Base_segment* 166 221 z13f5.s1 109 164
Base_segment 142 165 z25a11.s1 28 51
Base_segment* 142 165 z25a11.s1 28 51
Base_segment 77 141 z14f12.s1 77 141
Base_segment* 77 141 z14f12.s1 77 141
Base_segment 61 76 z13f5.s1 4 19
Base_segment* 61 76 z13f5.s1 4 19
Base_segment 1 60 z14f12.s1 1 60
Base_segment* 1 60 z14f12.s1 1 60

Quality information is indicated by a line with the word BaseQuality
followed by the name of the contig or read; this is followed by a line
or lines giving the numerical quality values. An arbitrary number of
values may appear on each line. As with other data, the end of the
BaseQuality information for a given sequence is indicated by a blank
line.

Example:

BaseQuality Contig3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 21 24 23 23 22 22 22 22 30 31 20 20 20 20 20 20 30 30 31 30 33 33 34 34 34 34 31 33 31 34 34 30 30 25
25 25 26 30 30 33 37 30 33 36 35 23 23 24 24 24 24 30 30 30 30 30 30 30 30 30 33 33 33 33 30 30 33 33 33 30 33 33 29 29 29 29 33 34 33 33 21 21 21 21
23 23 23 23 23 23 24 24 26 26 26 26 26 26 26 40 33 30 30 30 30 33 33 33 33 33 30 33 33 33 33 33 33 33 31 31 31 31 34 37 40 40 45 37 52 52 52 52 52 52
55 55 59 64 65 68 58 57 49 49 49 49 49 49 49 56 60 53 60 53 49 49 49 49 49 34 45 45 45 45 30 25 25 25 28 30 45 45 49 49 49 49 49 70 60 60 59 59 53 56
53 53 53 55 53 53 60 45 49 49 42 42 42 45 45 36 33 34 49 46 46 46 54 54 59 53 49 49 42 41 43 40 44 40 49 49 49 54 58 53 52 57 51 51 41 39 15 15 35 35

[N.B. MUCH OF THE FOLLOWING HAS NOT BEEN RECENTLY UPDATED AND MAY BE OBSOLETE].

6) standard output (redirected to phrap.out in the above example):
contains a list of the contigs and the reads they contain, with
information about how good the alignment of the read against the
contig sequence is. Also identifies matching regions within and
between contigs, suspect reads (probable deletions or chimeras),
indicates possible problems with assembly or contig sequence (e.g.
discrepancies between reads and contig sequence), and gives results of
the forward/reverse read consistency checks, There is quite a lot of
information in this output. It has the ultimate goal of giving a
complete picture of the final assembly, including possible sites of
assembly error; some of this information would probably be better
represented in graphical form.

The output begins by giving summary information about the parameter
settings and input data file(s) (sequence and quality). It then lists

(i) regions at beginnings or ends of reads which consist largely
(>50 %) of a single nucleotide, usually due to degenerate signal.
These are internally converted to N's, to prevent them from causing
spurious matches which could interfere with assembly. (Note that the
.pads file output by phrap will contain the N's rather than the
original sequence).

(ii) Exact and near-exact duplicate reads. Exact duplicates, which
probably represent duplicate entry of the same data, are excluded from
assembly. The near-exact duplicates (which are only listed if they
appear to come from different clones, according to the nomenclature
convention assumed by phrap) should be examined to make sure they
really are from different clones. If not, you are not using the
correct read nomenclature.

(iii) probable unremoved sequencing vector, detected as regions at
beginnings of reads which match only the beginnings of other reads.
This method is not foolproof and should not be considered a substitute
for screening out sequencing vector using cross_match prior to running
phrap. The unremoved vector may simply be inaccurate vector sequence,
that did not match the correct sequence in the original cross_match
screen but does match other reads in which similar errors occur. In
any case, since the procedure in phrap is not foolproof, you should be
sure to add the (potentially erroneous) sequence that is displayed in
the phrap output to your vector file, and redo the cross_match vector
screen, to increase the likelihood that you catch all of the vector.
If you find examples where sequence that is not vector is erroneously
being labelled as such by phrap, please let me know.

(iv) reads which have off-diagonal matches to themselves (due e.g.
to short tandem repeats), in the same orientation. This information
is ignored at present, but ultimately will be used to improve assembly
of tandem repeat regions. Due to the use of complexity-corrected
Smith-Waterman scores some commonly occurring microsatellites may not
be detected.

(v) probable chimeric reads, identified as reads whose confirmed
parts break into two pieces, separated by at most 30 bp and each
having a prematurely terminating alignment with some other read.
Chimeric reads are excluded from the assembly and instead appear as
singlet contigs. The condition that prematurely terminating alignments
exist is necessary to avoid falsely identifying "single links" as
chimeras. Since repeats also may cause prematurely terminating
alignments, it is possible (but unlikely) for a single link
non-chimeric read containing two or more repeats to be falsely
identified as a chimera.
Chimeras in which the first segment consists of thirty or so
bases near the beginning of the read are probably due to unmasked
sequencing vector. In fact, it is useful to test for unremoved vector
in as sensitive a fashion as possible by running phrap with -minscore
and -confirm_score both set to 25, and then examining the chimera list
(once sequencing vector has been removed the assembly itself should be
done with the default parameter values).

(vi) probable deletion reads, identified as reads which match some
other read but have a gap with respect to it of >10 bp (more
specifically, read 1 is determined to have a deletion with respect to
read 2 if there are two pairs of matching segments between the reads
such that the segments in read 1 are nearly adjacent (within 20 bases)
and the separation between the segments in read 2 is at least 10bp
larger than the separation in read 1), and for which there is no other
clone that shows the same deletion (the latter condition is needed to
rule out the possibility of repeated sequences for which some copies
have deletions). For each deletion clone and each such matching read,
the apparent size of the deletion, location of the deletion and (in
parentheses) name of the matching (undeleted) read and extent of the
segment that was deleted from in it are given.
As with the probable chimeras, the probable deletions are not used
in assembly and instead appear as singlet contigs.

(vii) Summary of "rejected" pairwise alignments; i.e. matches which
either prematurely terminate (do not extend as far as they should,
given the apparent sequence quality), or include mismatches between
high quality bases. In such cases the reads are probably from
different repeats, and so the alignments are not used in the assembly
(N.B. at present incompletely extended matches are considered during
the assembly, since a more sensitive criterion used there will reject
them if warranted).

(viii) (N.B. This part of output needs revision).
Summary of information about accepted pairwise alignments,
including no. of confirmed reads (i.e. reads matching some other
read), their average lengths (total, confirmed, "strongly confirmed"
(i.e. matching a reverse sense read) and trimmed); a crude estimate of
the cosmid size (assuming all non-rejected matches are real -- if
there are near-perfect repeats, the size may be underestimated) and
depth of coverage; a histogram of depths (letting the depth of a read
be the maximum depth that occurs in it); a summary of the
discrepancies between matching reads (by strand sense, nucleotide and
quality); and a histogram of spacings between adjacent indel pairs.
By sense of the aligned pair (reads in same direction, reads in reverse directions).

(ix) isolated singletons (reads having no match with score at least
minscore to any other read).

(x) contigs (including singlet contigs for which the read did
have a match with some other read, but could not be consistently
assembled with it). The contigs are sorted by (increasing) number of
reads, so that the singlet contigs appear first. The following
information appears for each contig: the number of reads; the total
length of the contig sequence (not including pads); the length of the
trimmed sequence (i.e. after regions consisting entirely of lower case
letters, N, and X have been removed from either end of the contig);
and a list of the reads in the contig and information about their
alignments. For each read the following information is given: a 'C'
if the read is in reverse orientation; the starting and ending
positions for the read (i.e. its full length, not just the part that
aligns with a given penalty) with respect to the padded sequence; the
read name; the score of the alignment of the read against the contig
sequence; (in parentheses) the highest score of a match of the read
against some other read in a different contig or elsewhere in the same
contig (so reads for which this number is non-zero are those which
overlap a repeat, or an incorrectly or incompletely assembled region);
the per cent mismatch, insertion, and deletion rates for the alignment
of the read against the contig sequence; the number of bases at the
beginning of the read which were not included in the alignment; (in
parentheses) the number of bases at the beginning of the read which
did not align against any other read (i.e. were not confirmed); and
similarly for the bases at the end of the read. If the unparenthesized
number is substantially greater than the parenthesized number that
accompanies it, there could be a problem with the alignment. Note
however that different penalties are used for aligning a read against
another read as opposed to aligning it against the contig sequence.

Following the read list is information to be used in assessing the
quality of the assembly. This includes:

A description of regions in the contig sequence whose quality has
been adjusted on the basis of alignments of reads to the contig (e.g.
regions that had double-stranded confirmation in the initial pairwise
comparison may not have when the contig is assembled, due to
resolution of repeats; and regions that were not confirmed in the
initial pairwise comparisons may be confirmed in the assembled contig,
due to the use of a lower gap penalty which gives more complete
alignments).

Histogram of the qualities of the contig bases; the extents of the
leading and trailing quality 0 regions of the contig (such regions are
likely to be highly inaccurate); and a list of the bases that have
quality < qual_show.

Slack and HQ Mismatch histograms (not yet documented here).

A table showing "gaps" on each strand (i.e. regions for which there
is no aligned part of any read, so that additional data, or possibly
editing to permit alignment of existing data, is required); the
closest read that could potentially be extended or edited to cover the
gap; whether or not the inaccurate, unaligned part of that read
already covers the gap (in which case editing might be sufficient to
close it); and the length of an accurate extended read (from the same
start in the same clone) that would be required to cover the gap. The
table appears following the list of reads in the contig and their
positions. An example (from a contig in C05D11):

Gap Size Closest read (Start) Covers Read length required
now? to cover
Top strand:
left - 244 244+
1809 - 1880 72 x72c5.s1 (1382) No 499
2967 - 2993 27 ae82b07.s1 (2562) Yes 432
5392 - 5536 145 z17c5.s1 (4965) No 572
6036 - 6320 285 z13f6.s1 (5537) No 784
8137 - right 0+ z26c9.s1 (7667) No 469+

Bottom strand:
left - 0 0+ z17b4.s1 ( 563) No 563+
1138 - 1139 2 z14f3.s1 (1143) Yes 6
6940 - 7230 291 z17c10.s1 (7666) No 727
7717 - right 420+

The gaps that say "left" or "right" are the ones at the left and
right ends of the contig (the gap size in this case is the part of the
existing contig sequence at that end that is not double stranded, with
the plus indicating that there is an additional gap to the left or the
right of the contig; these gaps are always designated not covered
("No")), and the relevant read in this case is the one that should be
extended to close gaps between contigs, while the other gaps are
internal gaps and the relevant reads are the ones required to get
double-stranding. Note one problem in this example: the bottom strand
gap of size 2 lies in the inaccurate 5' part of the z14f3.s1 read, so
that would probably not be an appropriate read to edit or get more
data from. Need appropriate cutoff values (i.e. size of 5' part of
read to ignore; this obviously depends on whether the 5' end has
already been trimmed prior to inputting it to phrap). Also, note that
since the full read lengths are used by phrap, the ends of the contigs
may include some very inaccurate sequence; so the size of the end gaps
(i.e. the total amount of inaccurate sequence there) could be larger
than indicated; this isn't particularly important though since in any
case one will always want to extend those reads to try to close the
gaps between contigs.

Phrap uses a lower gap penalty (-2) in aligning reads to the contig
sequence (to allow more complete double-stranding than the -9 default
used for aligning reads to each other).

Gaps with "No" in the Covers? column will always require more data (an
extended read or walk step, depending on the required length). Those
with "Yes" may be coverable by editing of the appropriate read; but it
may be simpler to automatically get longer reads for them also,
particularly if the gap size is large, since substantial editing would
probably be required in that case to use the existing data. Walking
primers could be output directly from OSP analysis of the contig
sequence (high quality part, i.e. all upper case letters) in cases
where the gap looks too large to close with an extended read
(presumably "finish" already does this?)

Following the gap table, there is a table giving, for each quality
value, the total number of bases of that quality in the reads for the
contig; the number which are aligned (i.e. included in the SWAT
alignments of the reads against the contig); the cumulative no. of
aligned bases (for this and higher qualities); the number of bases not
included in the alignment (but potentially alignable -- i.e. not
lying before the beginning or after the end of the second sequence);
the number of discrepancies of each type (substitution, deletion,
insertion); the total # of discrepancies (for this quality level) and
their percentage (of the aligned bases of the given quality), and the
cumulative # of discrepancies and their percentage (of the cumulative
no. of aligned bases). The regions at the ends of the sequence that
consist entirely of quality 0 are reassigned quality values of -1,
since these usually correspond to the lowest quality data at the
extreme ends of reads. (Not done in the contig quality histogram.)
An 'N' in either a read or the contig sequence is always counted as a
substitution error.

Depth 0 regions.

Following this there is more specific information about possible
problems with the assembly:

Unaligned segments: parts of reads of length at least 10 that did not
align to the contig sequence (the part of the read having quality -1
is not considered).

High quality discrepancies: sites in the contig sequence for
which at least one read with a quality at least qual_show disagrees
with the contig sequence. Most errors in the contig sequence will be
found either among these sites or in the regions where the contig
quality is < qual_show.

Low quality suspects: sites in the contig sequence of quality <
qual_show (but greater than 0) for which there is a discrepant read of
equal or greater quality. Such sites are particularly error-prone.
The list gives the site position, base in the contig sequence and its
quality, and quality of the discrepant read.

[Formerly also printed out: for each of the two strands, the # of
reads on that strand having a substitution, insertion or deletion
(with respect to the contig sequence) at that position, and depth (#
of reads whose alignments extend across that position). The site
position is preceded by a '<' (resp. '=') if there is a read with a
discrepant higher (resp. equivalent) quality base at that position.]

There then is given a list of the regions that are not covered by
"unique-reads" (i.e. reads with no non-rejected matches); and of
regions that match (as detected by one or more non-rejected pairwise
alignment between reads) other contigs or other regions in the same
contig. These are likely to be either true near-perfect repeats or
reflections of an error (of omission) in the assembly. To help
distinguish these possibilities, such regions are designated spanned
(if some read in the current contig extends all the way across it and
includes sequence to either side of it; a list of all such reads is
given) or "UNSPANNED" (if no such read exists). Misassembly errors of
commission generally arise from true near-perfect unspanned repeats,
and should result in an "UNSPANNED" flag for at least one of the two
matching regions. However not all of them do -- while the fact that
repeats exist should generally be detected by phrap, the full extent
of the repeat may in some cases not be detected, and the apparent
repeat may be "spanned" even though the full (true) repeat is not.
Moreover in some cases of misassembly only one of the matching contig
regions may be flagged as unspanned. Misassembly errors of omission
should always result in "UNSPANNED" matching regions at the end of one
or both contigs.
Regions of a non-singlet contig matching an anomalous (chimeric or
deleted) read are indicated only in the output for the singlet contig
that contains the anomalous read.

(ix) Following the contig information, there is a summary of the
forward/reverse read consistency checks. For each pair of reads with
the same clone name are given the contig no. and orientation within the
contig (0 = top strand, 1 = bottom strand) for each read; an "L" (for
"link") if the reads occur in different contigs; a '*' if there is an
inconsistency (e.g. forward and reverse reads which are not pointing
towards each other, or two forward reads which are not on the same
strand); distance between read starts for forward/reverse pairs
(should correspond to clone insert size, and be positive); and
distance between starts for forward/forward or reverse/reverse pairs.

N.B. IN ALL PLACES WHERE A POSITION IS GIVEN, THE POSITION IS WITH
RESPECT TO THE UNPADDED SEQUENCE, NOT THE PADDED SEQUENCE.

III) RUNNING cross_match.

The command line should be of the form

cross_match file_1 [file_2 ...] [-option value] [-option value] ...

The sequence files should be in FASTA format. If there are at least
two files, all sequences in the first file are compared to those in
the second (and any subsequent) files; if there is only one file, all
sequences in this file are compared to each other (N.B. at present
when there is a single input file, entries are not compared to their
own complements, but are compared to themselves in the same
orientation and to both orientations of all other reasons). Any
matches are printed out. Options are as described above.

Output:

(N.B. The following description is incomplete -- does not reflect
all of the available options).
The cross_match output lists matches (whose scores exceeding
minscore) between any sequence in the first input file (the "query"
sequences) and a sequence in the second (or later) input file (the
"subject" sequences). By default (i.e. if allmatches is not set) a
match is not shown if its domain is completely contained within the
domains of higher scoring matches, where the "domain" of a match is
the region in the query sequence that is defined by the alignment
start and stop. In the output, matches are ordered by query, and for
each query by position of the start of the alignment.
For each such match, an initial output line gives summary information:

Example:

440 2.38 1.39 0.79 hh44a1.s1 33 536 ( 0) C 00311 ( 3084) 8277 7771 *

Here
440 = smith-waterman score of the match (complexity-adjusted, by default).
2.38 = %substitutions in matching region
1.39 = %deletions (in 1st seq rel to 2d) in matching region
0.79 = %insertions (in 1st seq rel to 2d) in matching region
hh44a1.s1 = id of 1st sequence
33 = starting position of match in 1st sequence
536 = ending position of match in 1st sequence
(0) = no. of bases in 1st sequence past the ending position of match
(so 0 means that the match extended all the way to the end of
the 1st sequence)
C 00311 : match is with the Complement of sequence 00311
( 3084) : there are 3084 bases in (complement of) 2d sequence prior to
beginning of the match
8277 = starting position of match in 2d sequence (using top-strand
numbering)
7771 = ending position of match in 2d sequence
* indicates that there is a higher-scoring match whose domain partly includes the domain
of this match.

N.B. The following only appear if the appropriate option is selected.

Following each match there appears a listing of the discrepancies
between the aligned portions of the two sequences, giving for each
discrepancy its type, position, nucleotide, and quality (in first
sequence), and position in second sequence. This list is followed by a
table giving, for each quality value, the total number of bases of that
quality in the first sequence; the number which are included in the
SWAT alignment; the cumulative no. of aligned bases (for
this and higher qualities); the number of bases not included in the
alignment (but potentially alignable -- i.e. not lying before the
beginning or after the end of the second sequence); the number of
discrepancies of each type (substitution, deletion, insertion); the
total # of discrepancies (for this quality level) and their percentage
(of the aligned bases of the given quality), and the cumulative # of
discrepancies and their percentage (of the cumulative no. of aligned
bases). This information is useful for computing accuracy as a
function of quality, for automatically generated contig sequences.
In the discrepancy listing and table, the regions at the ends of the
sequence that consist entirely of quality 0 are reassigned quality
values of -1, since these usually correspond to the lowest quality
data at the extreme ends of reads. (Not done in the phrap output...)
An 'N' in either sequence is always counted as a substitution error.

[N.B. MUCH OF THE FOLLOWING IS NOW OBSOLETE].

Outline of phrap assembly:

0) Read in sequence & quality data, trim off any near-homopolymer runs
at ends of reads (since this is likely to be garbage), construct read
complements.

1) Find pairs of reads with matching words. Eliminate exact duplicate
reads. Do swat comparisons of pairs of reads which have matching
words, compute (complexity-adjusted) swat score.

2) Find probable vector matches (mark so they aren't used in assembly).

3) Find near duplicate reads.

4) Find reads with self-matches (probable tandem repeats).

5) Find matching read pairs that are "node-rejected" i.e. do not have
"solid" segments.

6) Use pairwise matches to identify confirmed parts of reads; use these to
compute revised quality values.

7) Compute LLR scores for each match (based on qualities of discrepant and
matching bases).

(Iterate above two steps).

8) Find best reads (highest LLR-scores among several overlapping).

9) Identify probable chimeric and deletion reads (withheld from assembly).

10) Construct contig layouts, using (non-rejected) pairwise matches in
decreasing score order (greedy algorithm). Consistency of layout is
checked at pairwise comparison level.

11) Construct contig sequence as mosaic of highest quality parts of
the reads.

12) Align reads to contig; tabulate inconsistencies (read / contig
discrepancies) & possible sites of misassembly. Adjust LLR-scores of
contig sequence.

Description of algorithms:

0) The procedure used in both phrap and cross_match to find
sequence matches is the following. First, (in phrap only) any region
at the beginning or end of a read that consists almost entirely of a
single letter is converted to 'N's; such regions are highly likely to
be of poor data quality which if not masked can lead to spurious
matches. Reads are then converted to uppercase (in order to allow
case-insensitive word matches). All matching words of length at least
minmatch between any pair of sequences are then found, by (i)
constructing a list of pointers to each position (in each sequence)
that begins a word of at least minmatch letters not containing 'N' or
'X'; (ii) sorting the list (using a modified version of quicksort,
with string comparison as the comparison function + a few tricks to
improve speed by keeping track of the parts of the words that are
known already to match); (iii) scanning the sorted list to find pairs
of matching words. For each such pair, a band of a specified width
(in the imaginary dot matrix for the two sequences), centered on the
diagonal defined by the matching words, is defined. Overlapping bands
(for the same pair of sequences) are merged. Following construction
and merging of bands, a "recursive" SWAT search of each band is used
to find matching segments with score greater than or equal to
minscore: "recursive" here means that if such a match is found, the
corresponding aligned segments in each sequence are (conceptually) X'd
out and the process repeated on the remaining portions (if any) of
each sequence. This procedure allows (at the cost of some redundant
calculation) detection of multiple matching pieces in different
locations, and will usually find most copies of repeats (since they
generally occur in separate bands).
By default, SWAT scores are complexity-adjusted (so that matches for
which the collection of matching nucleotides has biassed composition
have their scores significantly penalized.)

1) In phrap, sequence pairs with score >= minscore are considered for
possible merging. A critical issue here is the appropriate score
matrix for SWAT. [Given the generally high accuracy of reads and the
desirability of minimizing false matches due to imperfect repeats, a
relatively high penalty seems desirable. Currently I use +1 for a
match, -9 for a mismatch involving A,C,G, or T, 0 for a match or
mismatch involving N, -1 for a match or mismatch involving X, -11 for
the gap initiating penalty (the first residue in a gap), and -10 for
the gap extension penalty (each subsequent residue).] Setting the
indel penalties slightly higher than the mismatch penalties gives
better alignments by favoring mismatches in compression regions. Since
SWAT uses profiles, one has the option of distinguishing different
quality levels by use of different symbols (e.g. upper case for more
accurate calls, lower case for less accurate calls) and setting the
penalties appropriately; this would involve using the quality levels
to adjust the symbols used in the reads, which should be done AFTER
the word matching routine. (At present it does not seem particularly
useful to use differing positive scores for different nucleotides to
reflect their different frequencies).

1a) Determine "confirmed" part of each read (i.e. the part which
appears in a SWAT alignment against some other read; a read with the
same name -- up to an internal '.' if any -- is not considered
confirming). Probable chimeras are detected as reads for which the
confirmed part can be separated into two non-overlapping pieces,
separated by at most MAX_CHIMERA_GAP bp (currently 30) such that the
part confirmed by a given read lies in one or the other piece but not
both; AND such that each piece has a prematurely terminating alignment
with some other read. Chimeric reads may arise from i) chimeric
clones; ii) gel mistracking across lanes; iii) unremoved sequencing
vector; (iv) deletion clones. Reads having two non-overlapping
confirmed pieces (but failing the "premature termination" condition)
are often non-chimeras that are the only link between two
non-overlapping contigs, and thus should be permitted in the assembly.
Identify "strongly confirmed" regions in each read: having matches
to a reverse sense read.
Identify deletions.
Identify "rejected" alignments -- those which don't extend as far as
they should, or that have mismatches involving high quality bases.

2) Sort all matching pairs by decreasing LLR score, and assemble
layout by progressively merging pairs with high score. Consistency of
merge is required: the SWAT alignment implies relative offset of one
contig with respect to another, and hence a potential implied overlap.
The SWAT alignment(s) should extend over essentially the entire region
of implied overlap, apart from an allowable gap (necessary to allow
for the fact that the ends of the alignments are somewhat uncertain).
Probable deletion clones (identified as reads which have two adjacent
pieces matching two separated pieces of another read, and having no
confirming read across the breakpoint), and chimeras are not used in
any merges. [N.B. Following is obsolete: Potential chimeras are
deferred to a second pass (as well as being flagged in output), so
that true reads will assemble first.]

3) Construct contig "consensus" as a mosaic of individual reads.
Strategy: ends of alignments, and midpoints of perfectly matching
segments of sufficient length, define crosslinks - pursue crosslinks
which increase accuracy and extend read. Formally this is done by
constructing a weighted directed graph whose nodes consist of
(selected) positions in reads; there are bidirectional edges with
weight 0 between aligned bases in overlapping reads, and
unidirectional edges from 5' to 3' positions within a single read,
with weight equal to the total quality of the sequence between the two
nodes. Standard C.S. algorithms (dating back to Tarjan) permit
identification of a path with maximal weight, in time linear w.r.t.
number of nodes.
The quality values for the resulting sequence are inherited from the
read segments of which it is composed.