Slider is an application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.
Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this paper, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base’s probabilities given in the prb files.
Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNPs predictions are more accurate than other SNPs prediction approaches used today that start from the most probable sequence, including those using base quality.
Slider consists of five main commands and some utilities which are all java classes. Each of these commands/utilities requires some parameters that are listed in a configuration file. This is explained in further detail under "Running Slider".
The five main commands are:
- CreateDB.java for generating a reference database table: This table is generated using a sliding window of size SZd, where every subsequence of size SZd from the reference and its reverse complement are included. Up to two undefined bases, noted as Ns in each subsequence, are allowed by substituting all four bases for every N and generating all possible sub-sequences. Finally, this database table is lexicographically sorted. Generating a reference database table needs to be done once for each reference sequence. For example, the human genome reference database table has approximately 6 billion records (counting both forward and reverse complement) that are sorted in lexicographical order. For the human genome, this table was generated in less than 24 hours on a single CPU and used about 160 GB of disk space. Once a reference database table is generated, it can be used for every alignment to that reference, and there is no need for it to be regenerated with each alignment.
- CreateSequences.java for generating the P0_Reads Table: Using prb values, we first generate read sequences Rps for each prb line as explained earlier. One unique id is given for all reads that are generated from the same prb line source. When a prb line has a number of non-crisp bases larger than some threshold value UDmax (a value of 11 is used), only the MPS (most probable sequence) is generated and is marked as a low quality, LQ. LQ sequences will not be used for SNP prediction. The table of all reads generated from all prb lines is then lexicographically sorted to form the P0_Reads table.
- Alignment.Java: Find read locations on the reference sequence with an exact match and one-off match (one base mismatch) to prb derived sequences.
- Separation.java: for each reference sequence, aligned reads are then separated into two file: a unique match file and a multiple match file.
- SNPsPrediction.java: For each reference sequence, a SNP prediction table, a coverage information file, and a set of coverage histogram files are created from the unique matches file of a reference sequence.
Slider must first be compiled for your operating environment.
The package includes a set of sample data and a configuration file. Slider requires a configuration file to tell it where to look for the prb file(s) and the reference file and the location for storing results. In this example, the configuration file is found in:
It's contents include a parameter, a space, and the value. The location of this file is arbitrary and is specified on the command line as shown later when running Slider commands. In this example set, this file looks like:
oMxSize 27 skip 0 ETag SET DATASET T02/ DATABASE T02/ genome_dir data/seqDB/T02/ pathDB data/Database/ pathPRB data/PRBs/ pathData data/OUT/ ch_name1 SEQ_VAR T02_ref ch_name2 .fa pathMulti pathParallel
This can be used as a template for future assemblies. A suggested format is included here. These parameters describe the read length (oMxSize), dataset and database names, the folders containing the reference sequence (genome_dir), database, input prb files, and output folders. We also provide information on the reference sequence files. For this sample dataset:
- Each read is 27 bases, if the prb files have lines for more than 27 bases, only the first 27 bases will be used. No starting bases will be skipped ('skip' is set to 0).
- Its output will be placed in: /data/OUT/T02
- Its prb files are in: /data/PRBs/T02/*/*prb*. Note that Slider requires a subdirectory (of any name) inside the PRBs/dataset directory (T02 in this case) that will hold the prb files. This allows flexibility of using prb files with the same name.
- Its reference genome is in: /data/seqDB/T02/
- Its reference genome file is: T02_ref.fa. This is a concatenation of SEQ_VAR and ch_name2. The parameter, ch_name1 would be place before SEQ_VAR if this is required. This allows for flexibility in defining a list of reference sequences to align to. For example, to align to two reference sequences these lines in the configuration file could look like:
ch_name1 chr SEQ_VAR 1 2 ch_name2 .fa
In this case, the alignment would occur on the reference files chr1.fa and chr2.fa which are located in the the folder that 'genome_dir' is set to.
- The reference genome database will be placed in: /data/Database/T02/
- pathMulti and pathParallel are used in properties that are under development.
Once the parameters file for a dataset is created, each command/utility can be used as:
java –cp . command parametersFile
For example, using the dataset provided, if we are using the reference sequence for the first time, we need first to generate its database table:
java -cp . -Xmx1G JavaCode/CreateDB input/T02
Then, to generate the P0_Reads table using prb values for a dataset T02, then we use the command CreateSequences as:
java -cp . -Xmx1G JavaCode/CreateSequences input/T02
To align these sequences to the reference sequence, we use Alignment as:
java -cp . -Xmx1G JavaCode/Alignment input/T02
And then we need to separate the alignment result into different files associated with each reference sequence using the Separate command:
java -cp . -Xmx1G JavaCode/Separate input/T02
And finally SNPs prediction can be done using:
java -cp . -Xmx1G JavaCode/SNPsPrediction input/T02
This will show homozygous SNPs identified with a minimum coverage of 2 (nmCount).
SNP calling parameters can be adjusted as follows:
java –cp . -Xmx1G JavaCode/SNPsPrediction input/T02 [[c] t]
where c is the minimum nmCount coverage for SNPs <default is 2>
and t can be either 1 or 2 <default is 1>:
If t=1: Slider will search for homozygous SNPs only.
If t=2: Slider will search for homozygous and heterozygous SNPs.
For viewing alignment information, this command will send the alignment results to output.txt:
java -cp . -Xmx1G JavaCode/display input/T02 data/OUT/T02/Results/T02_ref.fa.um.s.id/ n
This will display the first n reads of the alignment. These include unique matches or U0 or U1. The default is the first 10 lines and to display all, set n to -1.
0: 7 3013629 F 1.0 36 986311278 Crisp MPS S0 UM U28 u0 SET CAAACTAACC AGCTTCACTA GGCTCCTGAC TACCTG ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXW 36 1: 7 3013629 F 1.0 36 978646192 Crisp MPS S0 UM U28 u0 SET CAAACTAACC AGCTTCACTA GGCTCCTGAC TACCTG ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXW 36 2: 7 3014948 R 1.0 36 989971955 Crisp MPS S0 UM U25 u0 SET GCCAAGGGAA ATTTTGAAGA TGTGTTGAAA AATAAT WXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36 3: 7 3014948 R 1.0 36 982306869 Crisp MPS S0 UM U25 u0 SET GCCAAGGGAA ATTTTGAAGA TGTGTTGAAA AATAAT WXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36 4: 7 3017750 R 1.0 36 997039988 GQ PS S0 UM U21 u0 SET AGAAAGAAaa ATATAAAAAT GCTAATATTG TGTATC DXTTTEZZAD ZJ\\\\^^F] ^^B^^M]]]\ \[[[ZZ 36 G9 T10 5: 7 3017856 F 1.0 36 993872057 Crisp MPS S0 UM U14 u0 SET AACGGCCCAT TTCTATCATC CAACCAAATG ATCTTT ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXL 36 6: 7 3020029 R 1.0 36 988863573 Crisp MPS S0 UM U16 u1 SET ATTTTAGGAg TGTGTTTATT TCTGAGACGT CTCCTT WXDTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36 <A10 (Q-26)> 7: 7 3020029 R 1.0 36 981198487 Crisp MPS S0 UM U16 u1 SET ATTTTAGGAg TGTGTTTATT TCTGAGACGT CTCCTT WXDTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36 <A10 (Q-26)> 8: 7 3021071 R 1.0 36 992052084 GQ MPS S0 UM U17 u0 SET TAGCCTCACA AGAGACAGTC ATATCTGGGT CCCTTC WXTTTXZTZZ Z\VJF\^^^^ ^^J^^O]]]\ \[[[ZZ 36 9: 7 3021085 R 1.0 36 984649984 Crisp MPS S0 UM U31 u0 SET cCAGcCATAT CTGGGTCCCT TCAGCAAAAT CTTGCT IXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36 A1 <T5 (Q-24)>
Above is the information for the first 6 matches in this alignment. The columns are as follows: read id, reference aligned to, reference coordinate of alignment start, forward or reverse strand, weight of read (see paper for further details), read length, alignment id, quality of read ("Crisp", "LQ" for low quality, or "GQ" - see paper for further details), MPS (most probable sequence) or RS, read end (if a PET run, can be either S0 or S1), SET (single end tag) or PET (paired end tag), UM = unique match or MM=multiple match, U0 (no mismatch bases) or U1 (1 base mismatch; a number follows with the location of the mismatch), sequence of aligned read.
To display the first n reads from the oligo input table (before alignment) s.ol0:
java -cp . -Xmx1G JavaCode/display input/T02 data/OUT/T02/s.ol0/ n
If n is larger than the number of reads in the oligo table, the full oligo table is displayed. This can be used to convert an oligo table to a text format. The contents of this file is sorted lexographically.
994: 0.23 27 999998827 GQ RS S0 AATCGCTTGA ACCTGAGAGG TGGAAGA 995: 0.74 27 999998827 GQ MPS S0 AATCGCTTGA ACCTGAGAGG TGGAAGT 996: 0.97 27 999999582 Crisp MPS S0 AATCTCCCAG CGTGGTTTAA TCAGACG 997: 0.0 27 999996886 GQ RS S0 AATCTCCCCA GTGACACTTC TCCAAGG 998: 0.05 27 999996886 GQ RS S0 AATCTCCCCA GTGACACTTC TCCCAGG 999: 0.04 27 999996886 GQ RS S0 AATCTCCCCA GTGTCACTTC TCCAAGG
Above is extracted from the sample oligo table.
Final list of SNPs produced by SNPsPrediction is located at: data/OUT/T02/Results/ScoreResults/T02_ref.fa.snp. For this small test dataset, eight SNPs were found.
84 15698 197088613 G A 8 8 4 44 30 A 100 0 0 0 Novel H1 3 15699 197091993 G T 0 0 2 8 0 - 0 0 0 100 Novel H1 50 15700 197104764 A C 2 2 2 0 26 C 0 100 0 0 Novel H1 13 15701 197104822 C T 1 1 2 92 0 - 0 0 0 100 Novel H1 77 15702 197112635 C A 2 2 2 81 26 A 100 0 0 0 Novel H1 77 15703 197140911 T G 2 2 2 64 28 G 0 0 100 0 Novel H1 29 15704 197141740 G C 1 1 2 89 0 - 0 100 0 0 Novel H1 13 15705 197170288 A G 1 1 2 92 0 - 0 0 100 0 Novel H1 13 15706 197190959 C T 1 1 2 92 0 - 0 0 0 100 Novel H1
Please see SNPs Table for a description of this file.
Expanded Table 3 (from the Slider paper) is available which includes extended information on the alignment accuracy of Slider compared to Eland and RMAP.
Two larger datasets are included here for Slider testing:
Released Nov 04, 2008
This is the first release of Slider. A seed-based release of Slider will be available at the end of November.
|0.6||Nov 04, 2008||This is the first release of Slider. A seed-based release of Slider will be available at the end of November||AFL||final|