/***************************************************************************** # Copyright (C) 1994-1996 by Phil Green. # All rights reserved. # # This software is part of a beta-test version of the swat/cross_match/phrap # package. It should not be redistributed or # used for any commercial purpose, including commercially funded # sequencing, without written permission from the author and the # University of Washington. # # This software is provided ``AS IS'' and any express or implied # warranties, including, but not limited to, the implied warranties of # merchantability and fitness for a particular purpose, are disclaimed. # In no event shall the authors or the University of Washington be # liable for any direct, indirect, incidental, special, exemplary, or # consequential damages (including, but not limited to, procurement of # substitute goods or services; loss of use, data, or profits; or # business interruption) however caused and on any theory of liability, # whether in contract, strict liability, or tort (including negligence # or otherwise) arising in any way out of the use of this software, even # if advised of the possibility of such damage. # #*****************************************************************************/ Documentation for: phrap ("phragment assembly program", or "phil's revised assembly program") -- a program for assembling shotgun DNA sequence data. Key features: allows use of entire read (not just highest quality part); uses a combination of user-supplied and internally computed data quality information to improve accuracy of assembly in the presence of repeats; constructs contig sequence as a mosaic of the highest quality parts of reads (rather than a consensus). N.B. phrap does not provide editing or viewing capabilities; these are available with consed or phrapview. cross_match -- a general-purpose utility (based on SWAT, an efficient implementation of the Smith-Waterman algorithm) for comparing any two sets of (long or short) DNA sequences. For example, it can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads; a set of cDNA sequences to a set of cosmids; contig sequences found by two alternative assembly procedures (e.g. phrap and xbap) to each other; or phrap contigs to the final edited cosmid sequence. It is slower but more sensitive (because it allows gaps) than BLASTN. I) INPUT FILES are of two types: sequence files (required), and quality files (optional). Ia) Sequence files. With both phrap and cross_match, either a single sequence file or multiple sequence files can be specified. If a single file is specified, all sequences in it are compared to each other; if more than one file is specified, sequences in the first file are compared to sequences in the second (and subsequent) files. Typically phrap is used with a single input sequence file containing the reads, unless one wants to assemble one set of reads "against" another sequence or set of sequences (see below). Phrap is designed to be able to use all of each read sequence in the assembly, not just the trimmed (highest quality) part, so the full sequence of each read should be provided. Cross_match is generally used with two input sequence files, the first containing the "query" sequences and the second containing the "subject" sequences. Sequence files must be in FASTA format: each sequence should have a single header line with leading character '>' followed immediately (i.e. with no intervening spaces) by the read name, and then (optionally), on the same line, a space or spaces followed by descriptive information about the read; the header line is followed by a separate line or lines containing the sequence. The read names, descriptive information, and sequence can be of arbitrary length (up to a combined total size of about 1 Gb), since space for them is allocated dynamically. Read names: It is assumed by phrap that the part of the read name up to the first '.', if any, is the name of the subclone from which the read is derived; this is used, for example, in checking for chimeric reads (for which purpose it is important to know when two reads come from the same subclone, since another read from the same subclone should not be regarded as independent confirmation of a join). IT IS IMPORTANT THAT DIFFERENT READS FROM THE SAME CLONE BE GIVEN NAMES THAT ADHERE TO THIS CONVENTION -- OTHERWISE THEY WILL BE INTERPRETED (INCORRECTLY) TO CONFIRM EACH OTHER, WHICH MAY ADVERSELY AFFECT ASSEMBLY. Orientation of the read (in cases where data is acquired from each end of a subclone insert) is assumed to be indicated by the first letter following the '.', namely 's' or 'x' for forward reads, and 'r' or 'y' for reverse reads. At present the read orientation is used only for consistency checking following assembly, and not in the assembly itself. In addition, it is assumed that dye terminator reads are indicated by an 'x' (for forward reads) or 'y' (for reverse reads) following the '.', and that 's' and 'r' refer to dye primer reads. This is used in the quality adjustment procedure (see below), where confirmation of a non-dye-terminator read by a terminator read counts as significantly as confirmation by an opposite-strand read. If the read name ends in ".seq" the .seq is removed. Descriptive information on the header line is not used by the program at present, although it is read in, and is written back out to the .ace and .singlets files. Sequence: Non-alphabetic characters (including '*') in the sequence lines are automatically stripped out when the file is read in, except for '-' which is converted to 'N'; '>' must not appear anywhere within the sequence since it is assumed to start the header of a new sequence (even if it is not at the beginning of a line). Lower case letters are converted to upper case on readin, so it is no longer possible (as in previous phrap versions) to use case as a crude indication of quality. Vector screening: Following creation of a FASTA file with the raw read sequences, it is important to remove or screen out sequencing (subclone) vector sequence. (It is unnecessary to remove the cloning (e.g. cosmid) vector, unless the inserts from multiple overlapping cosmids are being assembled simultaneously; however it is generally useful to remove at least the central part of the cosmid vector, since this then provides a natural point at which phrap can break the circular cosmid sequence into a linear one). Unremoved sequencing vector may cause reads to be identified as "possible chimeras", or otherwise interfere with proper assembly. Some vector removal programs may be unreliable at identifying vector sequences that are rearranged or found in the lower quality part of the read, and I recommend that you screen for sequencing vector using cross_match, whether or not you have already screened them using another program. First, create a FASTA file, say "vector", with all of the vector sequences you want to screen for (I use one that contains all vector sequences used in any of the ongoing sequencing projects, in case the vector is misidentified in the current project). If your read file is named "C05D11.reads" (for example), then run the command cross_match C05D11.reads vector -minmatch 12 -minscore 20 -screen > screen.out (This uses somewhat more sensitive parameter settings than the cross_match defaults). The '-screen' option causes a file named "C05D11.reads.screen" to be created, containing "vector masked" versions of the original sequences: i.e. any region that matches any part of a vector sequence is replaced by X's. This ".screen" file should be used in all subsequent analyses. The output from cross_match (which has been redirected to the file screen.out in the above example) lists the matches that were found (see below). N.B. If you have created a .qual file for your reads (say C05D11.reads.qual), be sure to rename or copy it to C05D11.reads.screen.qual so that it will be recognized by phrap when C05D11.reads.screen is used as the input FASTA file. Ib) Input of quality information. For each input sequence file in FASTA format, you may optionally include a corresponding file of data quality information. This is strongly recommended since it greatly improves the accuracy of assembly and of the consensus sequence. The name of this file should consist of the name of the FASTA file, with ".qual" appended. The format of the .qual file is similar to that of the corresponding FASTA file. For each read there should appear a header line identical (except possibly for the "description" field) to that in the FASTA file. This is followed by one or more lines giving the qualities for each base. Quality values should be integers between 0 and 99 (inclusive), and should be separated by spaces. No other (non-digit, non-white space) characters should appear. The total number of quality values for the read must match the number of bases that appear in the FASTA file. Also, the orders of reads in the FASTA and corresponding .qual file must be identical. Bases 'N' or 'X' are automatically assigned quality 0, whether or not a .qual file is provided. If provided, quality information is used by phrap both in the assembly itself (where discrepancies between high quality bases are used to discriminate repeats from true overlaps) and in determining the contig consensus sequence (which is formed as a mosaic of the highest quality parts of the reads). Because phrap generates its own "quality" measures (based on read-read confirmation) it performs reasonably well even if no input quality is provided, but when available the input data quality information can substantially increase the accuracy of the consensus, particularly in regions where there are strand-specific effects on quality (e.g. compressions) since in such cases the input quality information may constitute the only basis for choosing one strand over the other. If your sequencing data is from ABI 373 or 377 automated sequencers I strongly recommend that you generate the read sequences using the basecalling program phred, which will produce an appropriate quality file for input to phrap. Phred generally gives more accurate base calls overall, for a longer distance, than the ABI software does. On the basis of the trace characteristics, phred computes a probability p of an error in the base call at each position, and converts this to a quality value q using the transformation q = -10 log_10(p). Thus a quality of 30 corresponds to an error probability of 1 / 1000, a quality of of 40 to an error probability of 1 / 10000, etc. The quality value of 99 should be reserved for base calls that have been visually inspected and verified as "highly accurate" (during editing), and 98 for bases that have been edited but are not highly accurate (these are converted to quality 0 in phrap). If two reads appear to match but have discrepancies involving bases that have quality 99 in both reads, phrap does not allow them to be merged during assembly. At present this is the only way to break false joins made by phrap. If you wish to generate your own quality values you should be prepared to experiment. It is particularly important that possible insertion errors (which are fairly frequent late in the trace with ABI basecalls) receive quality 0, because in choosing the "consensus" phrap tends to favor the reads with the highest total quality in a given region. Also, if at all possible you should use quality values that are related to error probabilities in the manner described above, since to some extent phrap is tuned to expect these (and its output qualities will then have a similar interpretation). If all input quality values are relatively small (less than 15), phrap assumes that they do not correspond to error probabilities and attempts to rescale them so that the largest quality value is approximately 30. This allows different input scales to be used. Phrap adjusted quality values/error probabilities: Phrap computes adjusted quality values for each read on the basis of read-read confirmation information, as follows. If a read is confirmed by an opposite-strand or different-chemistry (dye terminator vs. dye primer) read at a given position, that position is given a quality which is the sum of the two input read qualities; when more than one opposite-strand read confirms a given position, only the single highest quality from all opposite-strand matching reads is used. If qualities are related to error probabilities as described above, this procedure can be interpreted as computing an error probability for each base which is the product of the error probabilities for the two reads; since error profiles for opposite-strand or different chemistry reads are essentially independent, this is reasonable, although it is somewhat conservative in that it does not take into account same-strand matches (which clearly should count for something, although they certainly are not independent). Contig positions are assigned quality values equal to the highest adjusted quality of any read at that position, and then adjusted downward to take into account any discrepancies with other reads. As a result, when phred input qualities are used, the output phrap qualities associated to each contig position have a natural interpretation as (conservative) error probabilities. Such error probabilities provide an extremely useful guide to where editing or additional data collection is needed. The phrap adjusted qualities are used in computing "LLR scores" for each pairwise match between two reads. These scores take into account the qualities of the base calls in the reads, and are (approximate) log-likelihood ratios for comparing the hypothesis that the reads truly overlap to the hypothesis that they are from 95% similar repeats. The point is that discrepancies between overlapping reads are due to base-calling errors and thus tend to occur in low-quality bases, whereas reads from different repeats can have high-quality discrepancies that are due to sequence differences between the repeats; and the probability of the observed data under each hypothesis can be quantified using the interpretation of the phrap qualities in terms of error probabilities. A pairwise match tends to have a positive LLR score if the two reads overlap, whereas it tends to have a negative LLR score if the two reads are from different repeats (unless the repeats are nearly identical). II) RUNNING phrap. The command line should be of the form phrap seq_file_name1 [seq_file_name2 ...] [-option value] [-option value] ... The output is extensive and should be redirected to a file. If more than one seq_file_name is provided, then reads in the first file are assembled only against sequences in the 2d (and later) files; no sequences are compared to other sequences in the same file. In this case, many features of phrap (e.g. checking for anomalies, vector, etc.) which rely on comprehensive read-read comparisons are turned off, SWAT scores are used in place of LLR_scores, and implied merges are not performed -- i.e. each read in the first file is merged with at most one sequence in the 2d file. This usage of phrap is convenient for aligning a set of reads against a known "finished" sequence in order to look for polymorphisms, for example. Example: phrap C05D11.reads.screen -minmatch 12 -ace > phrap.out The following options are available for both cross_match and phrap: option name value default value -penalty mismatch (substitution) penalty for SWAT comparisons -2 -gap_init gap initiating penalty for SWAT comparisons penalty - 2 -gap_ext gap extension penalty for SWAT comparisons penalty - 1 -minmatch minimum length of matching word to nucleate SWAT comparison; 14 if minmatch = 0, a full (non-banded) comparison is done -minscore minimum SWAT score 30 -bandwidth half band width for banded SWAT searches 14 (full width is 2 n + 1) -matrix score matrix for SWAT comparisons (supersedes -penalty) [default has +1 for matches, -penalty for mismatches] -raw use raw rather than complexity-adjusted Smith-Waterman scores. (default is to adjust scores to reflect complexity of matching sequence). (This is a flag; no value should be provided). -tags tags selected output lines, to facilitate parsing (This is a flag; no value should be provided). -indexwordsize size of indexing (hashing) words in words.c. The value of 10 this parameter has a (small) effect on run time and memory usage. The following options apply only to phrap: -maxgap maximum gap (unmatched region) allowed in merging contigs 30 -trim_start no. of bases to be removed at beginning of each read 0 -vector_bound no. of bases at beginning of each read, matches within which are assumed to be vector 60 -trim_penalty penalty used for identifying degenerate sequence at beginning/end of read -2 -trim_score minimum score for identifying degenerate sequence at beginning/end of read 20 -confirm_length minimum size of confirming segment (starts at 3d distinct 8 nuc following discrepancy) -confirm_trim amount by which confirming segments are trimmed at edges 1 -confirm_penalty penalty used in aligning against "confirming" reads -5 -confirm_score minimum alignment score for a read to be allowed to "confirm" part of another read 30 -qual_show LLR cutoff for flagging "low_quality" regions in 20 contig sequence and "high quality" discrepancies between read and contig. Bases in the .contigs, and .ace file are lower case iff their LLR-converted quality is below this value. -node_seg minimum segment size (for purposes of traversing weighted directed graph) 8 -node_space spacing between nodes (in weighted directed graph) 4 -revise_greedy splits initial greedy assembly into pieces at "weak joins", and then tries to reattach them to give higher overall score. (This is a flag; no value should be provided). -ace create ".ace" file with the assembly information padded reads, etc. (This is a flag; no value should be provided). -view create ".view" file suitable for input to phrapview (This is a flag; no value should be provided). The following options apply only to cross_match: -masklevel controls which matches (with scores at least minscore) are 80 displayed. A match is not displayed if the percentage of aligned query bases in the match that are contained within the aligned bases of higher-scoring matches is masklevel or higher. For example: if masklevel = 0, only the single highest scoring match for each query is displayed 100 any match that is not completely contained within a higher scoring match is displayed 101 all matches are displayed. -alignments display the alignment for each match (Warning -- may produce huge amount of output!) (This is a flag; no value should be provided). -discrep_lists give list of discrepancies for each match (This is a flag; no value should be provided). -discrep_tables give table of discrepancies (by quality) for each match (default is to display a single table, that combines data for all matches) (This is a flag; no value should be provided). -screen create a ".screen" file in which regions in the first input sequence file that match some region in the second (or later) file are replaced by X's. (This is a flag; no value should be provided). Normally you should just use the default values of the parameters; I haven't systematically explored optimal values for these, but the current ones appear to work reasonably well. To increase (potential) sensitivity of the pairwise comparisons, at the cost of a longer run time, try decreasing minmatch, minscore, and/or penalty, and increasing bandwidth. Phrap output files (with the exception of the standard output, all output files are given names consisting of the input read file, with an appropriate suffix (given below) appended): 1) .singlets file (named C05D11.reads.screen.singlets in the above example): A FASTA file containing the singlet sequences (i.e. the singleton reads with no match to any other read). 2) .contigs file (named C05D11.reads.screen.contigs in the above example): A FASTA file containing the sequences (without pads) of all contigs (upper case <=> quality is >= qual_show). These include singleton contigs consisting of single reads with a match to some other contig, but that couldn't be merged consistently with it. Thus reads that would go into xbap's "failed file" instead are considered separate contigs by phrap and appear with the other contigs. 3) .contigs.qual file: The corresponding quality file (see above for description of how phrap computes qualities for contig positions). 4) .log file: A summary of how the assembly proceeded, and how the contig sequences were constructed as mosaics of the individual reads. This is probably only of interest to me, for trouble-shooting purposes. 5) .ace file: (Produced only when the -ace option is used). A ".ace" formatted file containing (i) the sequences of the padded contigs and reads; (ii) the locations of the reads with respect to the contig sequence; (iii) information about how the contig sequence is pieced together as a mosaic of the high quality parts of the reads; (iv) an indication of where the reads can be "clipped" to remove low quality sequence at either end; (v) qualities of the bases in the contig sequence. Additional information will be added in the future. Sequence information is indicated by a header line consisting of the word DNA followed by the name of the sequence, followed by lines giving the sequence. Sequences are padded (with a '*' character) so that reads align directly against contigs, i.e. the alignments can be reconstructed exactly using only the information in this file. The end of the sequence is followed by a blank line. For reads in reverse orientation with respect to the contig, the sequence of the complement is given instead of the original sequence, and the suffix ".comp" is appended to the name of the read. The case (upper or lower) of bases may differ from that in C05D11.reads.screen, reflecting adjustment of quality values by phrap (upper case <=> LLR_quality is at least qual_show). Examples: DNA x72g12.s1.comp ccaatgtgacttggataatggcgtaaggcttgactttgacgtcccagtt* aa*ccctg*ctt*aaact*accttc*c*tta*tcctaaac*attaac*tt cttagctcaa*t*cccattcca*gcgaatcttca*aactgtcaaactana *tcgt*gaaatc*tcctccTGCTAGATtc*tcaaATCTGTCAATTGAAGA TCGTAGTTACTTGTTCCTCTACACAGATTTGCTTTTTGAATCTCCTGCCA TGATCGATGGAGTCTTGACATCTGCTGATGATGTTGCCAAACATTTCAC* GAAAGAT*CTCATTGATCACTCTATTCAAGT*GGGAGTC*TCCGGTCTTT ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC TTCAAGAATTCATCAATGTGCTCAAAAGTTGGCTGGAGAAGCTCGTGATC GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT TATGGAAAGAGTATGTTATTt DNA Contig35 GATAAGataATAAtGGAAAATaGAAaccGGAaAaATaATAAaATaaTTTc aGATcGcTGaAGAaGAaGaGAAGaGAATAGcAGccCaATGTGAGAAGCTC GGcAAAAAAGGACTCGAAGaaGcGGGAAAGAGTCTgGAAGcTGCCATTCT TGAGAACACTGcCAATcATcCaaGTGCTGAACTTTTGGATCAATTGATCG TCAAAGATcTTGAAGCTTTTGATCGTTTCCCAGTTCAATCCCTGACTTCA AACTCAcCTTCTC*TTACTCCTCAACAATCAACTTTCTTAGCTCAATT*C CCATTCCACGCGAATCTTCACAACTGTCCAACTAAATTCGtGGAAATCTT Clipping information for a read is indicated by a line consisting of the word "Sequence" followed by the name of the read, followed by two lines that each indicate the beginning and ending bases that remain after low quality (quality 0) sequence is removed from each end of the sequence. The line starting with the word "Clipping" gives the base coordinates with respect to the unpadded sequence, while the one starting with "Clipping*" refers to the padded sequence. If (and only if) the input fasta file included a "description" field (on the header line), that appears as a separate "Description" line following the clipping information. Example: Sequence x72g12.s1.comp Clipping 153 548 Clipping* 170 570 Description ALF run #5 The assembly information is indicated by a line consisting of the word "Sequence" followed by the name of the contig sequence, followed by lines giving the positions of the reads in this contig. For each read there are two lines: one starting with the word "Assembled_from" and giving the name of the read together with its starting and ending positions with respect to the unpadded contig sequence; and one starting with the word "Assembled_from*" which instead has positions with respect to the padded contig sequence. (In both cases the starting and ending positions are those of the first and last base of the read, not those of the "clipped" part). Note that it is possible for a read to overhang the end of the contig sequence, so that start positions may be negative and end positions may be greater than the number of the last base in the contig. Information about how the contig sequence is pieced together is given in the form of "Base_segment" lines: each such line gives the starting and ending positions of the segment within the contig sequence, the name of the read from which the sequence is taken, and the starting and ending positions of the corresponding segment within that read. Example: Sequence Contig35 Assembled_from z14f12.s1 1 596 Assembled_from* z14f12.s1 1 602 Assembled_from z13f5.s1 58 566 Assembled_from* z13f5.s1 58 572 Assembled_from z25a11.s1 115 738 Assembled_from* z25a11.s1 115 744 Assembled_from z14d4.s1.comp 145 741 Assembled_from* z14d4.s1.comp 145 747 Base_segment* 350 412 z26g6.s1 70 132 Base_segment 343 347 z13h11.s1 119 123 Base_segment* 345 349 z13h11.s1 121 125 Base_segment 339 342 z26g6.s1 60 63 Base_segment* 341 344 z26g6.s1 61 64 Base_segment 332 338 z13f5.s1 275 281 Base_segment* 334 340 z13f5.s1 277 283 Base_segment 331 331 z26g6.s1 52 52 Base_segment* 333 333 z26g6.s1 53 53 Base_segment 222 330 z25a11.s1 108 216 Base_segment* 222 332 z25a11.s1 108 218 Base_segment 166 221 z13f5.s1 109 164 Base_segment* 166 221 z13f5.s1 109 164 Base_segment 142 165 z25a11.s1 28 51 Base_segment* 142 165 z25a11.s1 28 51 Base_segment 77 141 z14f12.s1 77 141 Base_segment* 77 141 z14f12.s1 77 141 Base_segment 61 76 z13f5.s1 4 19 Base_segment* 61 76 z13f5.s1 4 19 Base_segment 1 60 z14f12.s1 1 60 Base_segment* 1 60 z14f12.s1 1 60 Quality information is indicated by a line with the word BaseQuality followed by the name of the contig or read; this is followed by a line or lines giving the numerical quality values. An arbitrary number of values may appear on each line. As with other data, the end of the BaseQuality information for a given sequence is indicated by a blank line. Example: BaseQuality Contig3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 21 24 23 23 22 22 22 22 30 31 20 20 20 20 20 20 30 30 31 30 33 33 34 34 34 34 31 33 31 34 34 30 30 25 25 25 26 30 30 33 37 30 33 36 35 23 23 24 24 24 24 30 30 30 30 30 30 30 30 30 33 33 33 33 30 30 33 33 33 30 33 33 29 29 29 29 33 34 33 33 21 21 21 21 23 23 23 23 23 23 24 24 26 26 26 26 26 26 26 40 33 30 30 30 30 33 33 33 33 33 30 33 33 33 33 33 33 33 31 31 31 31 34 37 40 40 45 37 52 52 52 52 52 52 55 55 59 64 65 68 58 57 49 49 49 49 49 49 49 56 60 53 60 53 49 49 49 49 49 34 45 45 45 45 30 25 25 25 28 30 45 45 49 49 49 49 49 70 60 60 59 59 53 56 53 53 53 55 53 53 60 45 49 49 42 42 42 45 45 36 33 34 49 46 46 46 54 54 59 53 49 49 42 41 43 40 44 40 49 49 49 54 58 53 52 57 51 51 41 39 15 15 35 35 [N.B. MUCH OF THE FOLLOWING HAS NOT BEEN RECENTLY UPDATED AND MAY BE OBSOLETE]. 6) standard output (redirected to phrap.out in the above example): contains a list of the contigs and the reads they contain, with information about how good the alignment of the read against the contig sequence is. Also identifies matching regions within and between contigs, suspect reads (probable deletions or chimeras), indicates possible problems with assembly or contig sequence (e.g. discrepancies between reads and contig sequence), and gives results of the forward/reverse read consistency checks, There is quite a lot of information in this output. It has the ultimate goal of giving a complete picture of the final assembly, including possible sites of assembly error; some of this information would probably be better represented in graphical form. The output begins by giving summary information about the parameter settings and input data file(s) (sequence and quality). It then lists (i) regions at beginnings or ends of reads which consist largely (>50 %) of a single nucleotide, usually due to degenerate signal. These are internally converted to N's, to prevent them from causing spurious matches which could interfere with assembly. (Note that the .pads file output by phrap will contain the N's rather than the original sequence). (ii) Exact and near-exact duplicate reads. Exact duplicates, which probably represent duplicate entry of the same data, are excluded from assembly. The near-exact duplicates (which are only listed if they appear to come from different clones, according to the nomenclature convention assumed by phrap) should be examined to make sure they really are from different clones. If not, you are not using the correct read nomenclature. (iii) probable unremoved sequencing vector, detected as regions at beginnings of reads which match only the beginnings of other reads. This method is not foolproof and should not be considered a substitute for screening out sequencing vector using cross_match prior to running phrap. The unremoved vector may simply be inaccurate vector sequence, that did not match the correct sequence in the original cross_match screen but does match other reads in which similar errors occur. In any case, since the procedure in phrap is not foolproof, you should be sure to add the (potentially erroneous) sequence that is displayed in the phrap output to your vector file, and redo the cross_match vector screen, to increase the likelihood that you catch all of the vector. If you find examples where sequence that is not vector is erroneously being labelled as such by phrap, please let me know. (iv) reads which have off-diagonal matches to themselves (due e.g. to short tandem repeats), in the same orientation. This information is ignored at present, but ultimately will be used to improve assembly of tandem repeat regions. Due to the use of complexity-corrected Smith-Waterman scores some commonly occurring microsatellites may not be detected. (v) probable chimeric reads, identified as reads whose confirmed parts break into two pieces, separated by at most 30 bp and each having a prematurely terminating alignment with some other read. Chimeric reads are excluded from the assembly and instead appear as singlet contigs. The condition that prematurely terminating alignments exist is necessary to avoid falsely identifying "single links" as chimeras. Since repeats also may cause prematurely terminating alignments, it is possible (but unlikely) for a single link non-chimeric read containing two or more repeats to be falsely identified as a chimera. Chimeras in which the first segment consists of thirty or so bases near the beginning of the read are probably due to unmasked sequencing vector. In fact, it is useful to test for unremoved vector in as sensitive a fashion as possible by running phrap with -minscore and -confirm_score both set to 25, and then examining the chimera list (once sequencing vector has been removed the assembly itself should be done with the default parameter values). (vi) probable deletion reads, identified as reads which match some other read but have a gap with respect to it of >10 bp (more specifically, read 1 is determined to have a deletion with respect to read 2 if there are two pairs of matching segments between the reads such that the segments in read 1 are nearly adjacent (within 20 bases) and the separation between the segments in read 2 is at least 10bp larger than the separation in read 1), and for which there is no other clone that shows the same deletion (the latter condition is needed to rule out the possibility of repeated sequences for which some copies have deletions). For each deletion clone and each such matching read, the apparent size of the deletion, location of the deletion and (in parentheses) name of the matching (undeleted) read and extent of the segment that was deleted from in it are given. As with the probable chimeras, the probable deletions are not used in assembly and instead appear as singlet contigs. (vii) Summary of "rejected" pairwise alignments; i.e. matches which either prematurely terminate (do not extend as far as they should, given the apparent sequence quality), or include mismatches between high quality bases. In such cases the reads are probably from different repeats, and so the alignments are not used in the assembly (N.B. at present incompletely extended matches are considered during the assembly, since a more sensitive criterion used there will reject them if warranted). (viii) (N.B. This part of output needs revision). Summary of information about accepted pairwise alignments, including no. of confirmed reads (i.e. reads matching some other read), their average lengths (total, confirmed, "strongly confirmed" (i.e. matching a reverse sense read) and trimmed); a crude estimate of the cosmid size (assuming all non-rejected matches are real -- if there are near-perfect repeats, the size may be underestimated) and depth of coverage; a histogram of depths (letting the depth of a read be the maximum depth that occurs in it); a summary of the discrepancies between matching reads (by strand sense, nucleotide and quality); and a histogram of spacings between adjacent indel pairs. By sense of the aligned pair (reads in same direction, reads in reverse directions). (ix) isolated singletons (reads having no match with score at least minscore to any other read). (x) contigs (including singlet contigs for which the read did have a match with some other read, but could not be consistently assembled with it). The contigs are sorted by (increasing) number of reads, so that the singlet contigs appear first. The following information appears for each contig: the number of reads; the total length of the contig sequence (not including pads); the length of the trimmed sequence (i.e. after regions consisting entirely of lower case letters, N, and X have been removed from either end of the contig); and a list of the reads in the contig and information about their alignments. For each read the following information is given: a 'C' if the read is in reverse orientation; the starting and ending positions for the read (i.e. its full length, not just the part that aligns with a given penalty) with respect to the padded sequence; the read name; the score of the alignment of the read against the contig sequence; (in parentheses) the highest score of a match of the read against some other read in a different contig or elsewhere in the same contig (so reads for which this number is non-zero are those which overlap a repeat, or an incorrectly or incompletely assembled region); the per cent mismatch, insertion, and deletion rates for the alignment of the read against the contig sequence; the number of bases at the beginning of the read which were not included in the alignment; (in parentheses) the number of bases at the beginning of the read which did not align against any other read (i.e. were not confirmed); and similarly for the bases at the end of the read. If the unparenthesized number is substantially greater than the parenthesized number that accompanies it, there could be a problem with the alignment. Note however that different penalties are used for aligning a read against another read as opposed to aligning it against the contig sequence. Following the read list is information to be used in assessing the quality of the assembly. This includes: A description of regions in the contig sequence whose quality has been adjusted on the basis of alignments of reads to the contig (e.g. regions that had double-stranded confirmation in the initial pairwise comparison may not have when the contig is assembled, due to resolution of repeats; and regions that were not confirmed in the initial pairwise comparisons may be confirmed in the assembled contig, due to the use of a lower gap penalty which gives more complete alignments). Histogram of the qualities of the contig bases; the extents of the leading and trailing quality 0 regions of the contig (such regions are likely to be highly inaccurate); and a list of the bases that have quality < qual_show. Slack and HQ Mismatch histograms (not yet documented here). A table showing "gaps" on each strand (i.e. regions for which there is no aligned part of any read, so that additional data, or possibly editing to permit alignment of existing data, is required); the closest read that could potentially be extended or edited to cover the gap; whether or not the inaccurate, unaligned part of that read already covers the gap (in which case editing might be sufficient to close it); and the length of an accurate extended read (from the same start in the same clone) that would be required to cover the gap. The table appears following the list of reads in the contig and their positions. An example (from a contig in C05D11): Gap Size Closest read (Start) Covers Read length required now? to cover Top strand: left - 244 244+ 1809 - 1880 72 x72c5.s1 (1382) No 499 2967 - 2993 27 ae82b07.s1 (2562) Yes 432 5392 - 5536 145 z17c5.s1 (4965) No 572 6036 - 6320 285 z13f6.s1 (5537) No 784 8137 - right 0+ z26c9.s1 (7667) No 469+ Bottom strand: left - 0 0+ z17b4.s1 ( 563) No 563+ 1138 - 1139 2 z14f3.s1 (1143) Yes 6 6940 - 7230 291 z17c10.s1 (7666) No 727 7717 - right 420+ The gaps that say "left" or "right" are the ones at the left and right ends of the contig (the gap size in this case is the part of the existing contig sequence at that end that is not double stranded, with the plus indicating that there is an additional gap to the left or the right of the contig; these gaps are always designated not covered ("No")), and the relevant read in this case is the one that should be extended to close gaps between contigs, while the other gaps are internal gaps and the relevant reads are the ones required to get double-stranding. Note one problem in this example: the bottom strand gap of size 2 lies in the inaccurate 5' part of the z14f3.s1 read, so that would probably not be an appropriate read to edit or get more data from. Need appropriate cutoff values (i.e. size of 5' part of read to ignore; this obviously depends on whether the 5' end has already been trimmed prior to inputting it to phrap). Also, note that since the full read lengths are used by phrap, the ends of the contigs may include some very inaccurate sequence; so the size of the end gaps (i.e. the total amount of inaccurate sequence there) could be larger than indicated; this isn't particularly important though since in any case one will always want to extend those reads to try to close the gaps between contigs. Phrap uses a lower gap penalty (-2) in aligning reads to the contig sequence (to allow more complete double-stranding than the -9 default used for aligning reads to each other). Gaps with "No" in the Covers? column will always require more data (an extended read or walk step, depending on the required length). Those with "Yes" may be coverable by editing of the appropriate read; but it may be simpler to automatically get longer reads for them also, particularly if the gap size is large, since substantial editing would probably be required in that case to use the existing data. Walking primers could be output directly from OSP analysis of the contig sequence (high quality part, i.e. all upper case letters) in cases where the gap looks too large to close with an extended read (presumably "finish" already does this?) Following the gap table, there is a table giving, for each quality value, the total number of bases of that quality in the reads for the contig; the number which are aligned (i.e. included in the SWAT alignments of the reads against the contig); the cumulative no. of aligned bases (for this and higher qualities); the number of bases not included in the alignment (but potentially alignable -- i.e. not lying before the beginning or after the end of the second sequence); the number of discrepancies of each type (substitution, deletion, insertion); the total # of discrepancies (for this quality level) and their percentage (of the aligned bases of the given quality), and the cumulative # of discrepancies and their percentage (of the cumulative no. of aligned bases). The regions at the ends of the sequence that consist entirely of quality 0 are reassigned quality values of -1, since these usually correspond to the lowest quality data at the extreme ends of reads. (Not done in the contig quality histogram.) An 'N' in either a read or the contig sequence is always counted as a substitution error. Depth 0 regions. Following this there is more specific information about possible problems with the assembly: Unaligned segments: parts of reads of length at least 10 that did not align to the contig sequence (the part of the read having quality -1 is not considered). High quality discrepancies: sites in the contig sequence for which at least one read with a quality at least qual_show disagrees with the contig sequence. Most errors in the contig sequence will be found either among these sites or in the regions where the contig quality is < qual_show. Low quality suspects: sites in the contig sequence of quality < qual_show (but greater than 0) for which there is a discrepant read of equal or greater quality. Such sites are particularly error-prone. The list gives the site position, base in the contig sequence and its quality, and quality of the discrepant read. [Formerly also printed out: for each of the two strands, the # of reads on that strand having a substitution, insertion or deletion (with respect to the contig sequence) at that position, and depth (# of reads whose alignments extend across that position). The site position is preceded by a '<' (resp. '=') if there is a read with a discrepant higher (resp. equivalent) quality base at that position.] There then is given a list of the regions that are not covered by "unique-reads" (i.e. reads with no non-rejected matches); and of regions that match (as detected by one or more non-rejected pairwise alignment between reads) other contigs or other regions in the same contig. These are likely to be either true near-perfect repeats or reflections of an error (of omission) in the assembly. To help distinguish these possibilities, such regions are designated spanned (if some read in the current contig extends all the way across it and includes sequence to either side of it; a list of all such reads is given) or "UNSPANNED" (if no such read exists). Misassembly errors of commission generally arise from true near-perfect unspanned repeats, and should result in an "UNSPANNED" flag for at least one of the two matching regions. However not all of them do -- while the fact that repeats exist should generally be detected by phrap, the full extent of the repeat may in some cases not be detected, and the apparent repeat may be "spanned" even though the full (true) repeat is not. Moreover in some cases of misassembly only one of the matching contig regions may be flagged as unspanned. Misassembly errors of omission should always result in "UNSPANNED" matching regions at the end of one or both contigs. Regions of a non-singlet contig matching an anomalous (chimeric or deleted) read are indicated only in the output for the singlet contig that contains the anomalous read. (ix) Following the contig information, there is a summary of the forward/reverse read consistency checks. For each pair of reads with the same clone name are given the contig no. and orientation within the contig (0 = top strand, 1 = bottom strand) for each read; an "L" (for "link") if the reads occur in different contigs; a '*' if there is an inconsistency (e.g. forward and reverse reads which are not pointing towards each other, or two forward reads which are not on the same strand); distance between read starts for forward/reverse pairs (should correspond to clone insert size, and be positive); and distance between starts for forward/forward or reverse/reverse pairs. N.B. IN ALL PLACES WHERE A POSITION IS GIVEN, THE POSITION IS WITH RESPECT TO THE UNPADDED SEQUENCE, NOT THE PADDED SEQUENCE. III) RUNNING cross_match. The command line should be of the form cross_match file_1 [file_2 ...] [-option value] [-option value] ... The sequence files should be in FASTA format. If there are at least two files, all sequences in the first file are compared to those in the second (and any subsequent) files; if there is only one file, all sequences in this file are compared to each other (N.B. at present when there is a single input file, entries are not compared to their own complements, but are compared to themselves in the same orientation and to both orientations of all other reasons). Any matches are printed out. Options are as described above. Output: (N.B. The following description is incomplete -- does not reflect all of the available options). The cross_match output lists matches (whose scores exceeding minscore) between any sequence in the first input file (the "query" sequences) and a sequence in the second (or later) input file (the "subject" sequences). By default (i.e. if allmatches is not set) a match is not shown if its domain is completely contained within the domains of higher scoring matches, where the "domain" of a match is the region in the query sequence that is defined by the alignment start and stop. In the output, matches are ordered by query, and for each query by position of the start of the alignment. For each such match, an initial output line gives summary information: Example: 440 2.38 1.39 0.79 hh44a1.s1 33 536 ( 0) C 00311 ( 3084) 8277 7771 * Here 440 = smith-waterman score of the match (complexity-adjusted, by default). 2.38 = %substitutions in matching region 1.39 = %deletions (in 1st seq rel to 2d) in matching region 0.79 = %insertions (in 1st seq rel to 2d) in matching region hh44a1.s1 = id of 1st sequence 33 = starting position of match in 1st sequence 536 = ending position of match in 1st sequence (0) = no. of bases in 1st sequence past the ending position of match (so 0 means that the match extended all the way to the end of the 1st sequence) C 00311 : match is with the Complement of sequence 00311 ( 3084) : there are 3084 bases in (complement of) 2d sequence prior to beginning of the match 8277 = starting position of match in 2d sequence (using top-strand numbering) 7771 = ending position of match in 2d sequence * indicates that there is a higher-scoring match whose domain partly includes the domain of this match. N.B. The following only appear if the appropriate option is selected. Following each match there appears a listing of the discrepancies between the aligned portions of the two sequences, giving for each discrepancy its type, position, nucleotide, and quality (in first sequence), and position in second sequence. This list is followed by a table giving, for each quality value, the total number of bases of that quality in the first sequence; the number which are included in the SWAT alignment; the cumulative no. of aligned bases (for this and higher qualities); the number of bases not included in the alignment (but potentially alignable -- i.e. not lying before the beginning or after the end of the second sequence); the number of discrepancies of each type (substitution, deletion, insertion); the total # of discrepancies (for this quality level) and their percentage (of the aligned bases of the given quality), and the cumulative # of discrepancies and their percentage (of the cumulative no. of aligned bases). This information is useful for computing accuracy as a function of quality, for automatically generated contig sequences. In the discrepancy listing and table, the regions at the ends of the sequence that consist entirely of quality 0 are reassigned quality values of -1, since these usually correspond to the lowest quality data at the extreme ends of reads. (Not done in the phrap output...) An 'N' in either sequence is always counted as a substitution error. [N.B. MUCH OF THE FOLLOWING IS NOW OBSOLETE]. Outline of phrap assembly: 0) Read in sequence & quality data, trim off any near-homopolymer runs at ends of reads (since this is likely to be garbage), construct read complements. 1) Find pairs of reads with matching words. Eliminate exact duplicate reads. Do swat comparisons of pairs of reads which have matching words, compute (complexity-adjusted) swat score. 2) Find probable vector matches (mark so they aren't used in assembly). 3) Find near duplicate reads. 4) Find reads with self-matches (probable tandem repeats). 5) Find matching read pairs that are "node-rejected" i.e. do not have "solid" segments. 6) Use pairwise matches to identify confirmed parts of reads; use these to compute revised quality values. 7) Compute LLR scores for each match (based on qualities of discrepant and matching bases). (Iterate above two steps). 8) Find best reads (highest LLR-scores among several overlapping). 9) Identify probable chimeric and deletion reads (withheld from assembly). 10) Construct contig layouts, using (non-rejected) pairwise matches in decreasing score order (greedy algorithm). Consistency of layout is checked at pairwise comparison level. 11) Construct contig sequence as mosaic of highest quality parts of the reads. 12) Align reads to contig; tabulate inconsistencies (read / contig discrepancies) & possible sites of misassembly. Adjust LLR-scores of contig sequence. Description of algorithms: 0) The procedure used in both phrap and cross_match to find sequence matches is the following. First, (in phrap only) any region at the beginning or end of a read that consists almost entirely of a single letter is converted to 'N's; such regions are highly likely to be of poor data quality which if not masked can lead to spurious matches. Reads are then converted to uppercase (in order to allow case-insensitive word matches). All matching words of length at least minmatch between any pair of sequences are then found, by (i) constructing a list of pointers to each position (in each sequence) that begins a word of at least minmatch letters not containing 'N' or 'X'; (ii) sorting the list (using a modified version of quicksort, with string comparison as the comparison function + a few tricks to improve speed by keeping track of the parts of the words that are known already to match); (iii) scanning the sorted list to find pairs of matching words. For each such pair, a band of a specified width (in the imaginary dot matrix for the two sequences), centered on the diagonal defined by the matching words, is defined. Overlapping bands (for the same pair of sequences) are merged. Following construction and merging of bands, a "recursive" SWAT search of each band is used to find matching segments with score greater than or equal to minscore: "recursive" here means that if such a match is found, the corresponding aligned segments in each sequence are (conceptually) X'd out and the process repeated on the remaining portions (if any) of each sequence. This procedure allows (at the cost of some redundant calculation) detection of multiple matching pieces in different locations, and will usually find most copies of repeats (since they generally occur in separate bands). By default, SWAT scores are complexity-adjusted (so that matches for which the collection of matching nucleotides has biassed composition have their scores significantly penalized.) 1) In phrap, sequence pairs with score >= minscore are considered for possible merging. A critical issue here is the appropriate score matrix for SWAT. [Given the generally high accuracy of reads and the desirability of minimizing false matches due to imperfect repeats, a relatively high penalty seems desirable. Currently I use +1 for a match, -9 for a mismatch involving A,C,G, or T, 0 for a match or mismatch involving N, -1 for a match or mismatch involving X, -11 for the gap initiating penalty (the first residue in a gap), and -10 for the gap extension penalty (each subsequent residue).] Setting the indel penalties slightly higher than the mismatch penalties gives better alignments by favoring mismatches in compression regions. Since SWAT uses profiles, one has the option of distinguishing different quality levels by use of different symbols (e.g. upper case for more accurate calls, lower case for less accurate calls) and setting the penalties appropriately; this would involve using the quality levels to adjust the symbols used in the reads, which should be done AFTER the word matching routine. (At present it does not seem particularly useful to use differing positive scores for different nucleotides to reflect their different frequencies). 1a) Determine "confirmed" part of each read (i.e. the part which appears in a SWAT alignment against some other read; a read with the same name -- up to an internal '.' if any -- is not considered confirming). Probable chimeras are detected as reads for which the confirmed part can be separated into two non-overlapping pieces, separated by at most MAX_CHIMERA_GAP bp (currently 30) such that the part confirmed by a given read lies in one or the other piece but not both; AND such that each piece has a prematurely terminating alignment with some other read. Chimeric reads may arise from i) chimeric clones; ii) gel mistracking across lanes; iii) unremoved sequencing vector; (iv) deletion clones. Reads having two non-overlapping confirmed pieces (but failing the "premature termination" condition) are often non-chimeras that are the only link between two non-overlapping contigs, and thus should be permitted in the assembly. Identify "strongly confirmed" regions in each read: having matches to a reverse sense read. Identify deletions. Identify "rejected" alignments -- those which don't extend as far as they should, or that have mismatches involving high quality bases. 2) Sort all matching pairs by decreasing LLR score, and assemble layout by progressively merging pairs with high score. Consistency of merge is required: the SWAT alignment implies relative offset of one contig with respect to another, and hence a potential implied overlap. The SWAT alignment(s) should extend over essentially the entire region of implied overlap, apart from an allowable gap (necessary to allow for the fact that the ends of the alignments are somewhat uncertain). Probable deletion clones (identified as reads which have two adjacent pieces matching two separated pieces of another read, and having no confirming read across the breakpoint), and chimeras are not used in any merges. [N.B. Following is obsolete: Potential chimeras are deferred to a second pass (as well as being flagged in output), so that true reads will assemble first.] 3) Construct contig "consensus" as a mosaic of individual reads. Strategy: ends of alignments, and midpoints of perfectly matching segments of sufficient length, define crosslinks - pursue crosslinks which increase accuracy and extend read. Formally this is done by constructing a weighted directed graph whose nodes consist of (selected) positions in reads; there are bidirectional edges with weight 0 between aligned bases in overlapping reads, and unidirectional edges from 5' to 3' positions within a single read, with weight equal to the total quality of the sequence between the two nodes. Standard C.S. algorithms (dating back to Tarjan) permit identification of a path with maximal weight, in time linear w.r.t. number of nodes. The quality values for the resulting sequence are inherited from the read segments of which it is composed.