High quality SNP calling using Illumina data at minimal coverage
Released Jun 09, 2009
More about this release…
- Get SliderII for all platforms
- Slider II 1.1
SliderII paperUnder review.
Original Slider paperMalhis et. al. 2008 - High Quality Alignment and SNP Calling for Illumina High Throughput Sequence Data
SliderII Results of SNPs concordance comparison to Maq
We used 68 lanes human whole genome shotgun sequencing, WGSS, PET data (real data), a total of 906 million reads sizes from 36 to 42 bases, using each of SliderII and MAQ, we aligned these reads to the human genome hg18 resulting a coverage of about 7.5 X. For each aligner, SNPs are sorted in descending order, using SNPs score for SliderII, and the Phred-like consensus quality score for MAQ. Concordance with Ensembl Variation 50 SNPs is used to compare SNPs calling accuracy.
Figure 1 show this concordance of top scored SNPs for SliderII without using known SNPs as priors (SliderII A line), and with known SNPs priors to be used in the alignment and SNP calling processes (SliderII B) and for MAQ at minimum coverage depths of 1, 2, and 3. In figure 1, while SliderII SNPs concordance is correlated with their scores, MAQ concordance dropped significantly for those SNPs with high quality score. We contributed that to SNPs generated as a result of paralogous alignments since MAQ dose not have the capability of filtering out such SNPs, so, in order to reduce those inaccurately called SNPs in MAQ output, we filtered out SNPs with coverage higher than 15, figure 2.
when we have a concordance of x%, this means that:
1- Our SNPs calling accuracy is at least x%. It is exactly x% when the sample data contain only known SNPs.
2- The SNPs in our sample data has at most (1-x)% novel SNPs. It is (1-x)% when we have no calling errors.
So, for example, if the sample data contain y% novel SNPs, (and y <= 1-x), novel SNPs accuracy is z%, and z = y/(1-x).
Slider is an application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.
Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this paper, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base’s probabilities given in the prb files.
Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNPs predictions are more accurate than other SNPs prediction approaches used today that start from the most probable sequence, including those using base quality.
Slider consists of seven main commands and some utilities which are all java classes. Each of these commands/utilities requires some parameters that are listed in a configuration file. This is explained in further detail under "Running Slider".
These commands are described below:
- CreateDB2.java for generating a reference database table: This table is generated using a sliding window of size SZd, where every subsequence of size SZd from the reference and its reverse complement are included. Up to two undefined bases, noted as Ns in each subsequence, are allowed by substituting all four bases for every N and generating all possible sub-sequences. Finally, this database table is lexicographically sorted. Generating a reference database table needs to be done once for each reference sequence. For example, the human genome reference database table has approximately 6 billion records (counting both forward and reverse complement) that are sorted in lexicographical order. For the human genome, this table was generated in less than 24 hours on a single CPU and used about 300 GB of disk space. Once a reference database table is generated, it can be used for every alignment to that reference, and there is no need for it to be regenerated with each alignment.
- CreateSequences.java for generating the P0_Reads Table: Using prb values, we first generate read sequences Rps for each prb line as explained earlier. One unique id is given for all reads that are generated from the same prb line source. When a prb line has a number of non-crisp bases larger than some threshold value UDmax (a value of 11 is used), only the MPS (most probable sequence) is generated and is marked as a low quality, LQ. LQ sequences will not be used for SNP prediction. The table of all reads generated from all prb lines is then lexicographically sorted to form the P0_Reads table.
- Alignment.Java: Find read locations on the reference sequence with an exact match and one-off match (one base mismatch) to prb derived sequences.
- Extend.java: Expand reads to include up to 3 mismatches
- Separate.java: for each reference sequence, aligned reads are then separated into two file: a unique match file and a multiple match file.
- CallSNPs.java: Generates SNP and alignment information files.
- ScoreSNPs: Adds a single score in the range 0-100 (where 100 is highest score) that coorelates with SNPs accuracy by combining other scores .
Slider must first be compiled for your operating environment.
The package includes a set of sample data and a configuration file. Slider requires a configuration file to tell it where to look for the prb file(s) and the reference file and the location for storing results. In this example, there are two configuration files one for a single end tag alignment, T02, and one for a paired end tag alignment which will be used in this example:
It's contents include a parameter, a space, and the value. The location of this file is arbitrary and is specified on the command line as shown later when running Slider commands. In this example set, this file looks like:
seedSize 31 skip -1 ETag PET SNP_in 100000 DATASET BACPET/ DATABASE BACPET/ genome_dir data/seqDB/BACPET/ KnownSNPs data/seqDB/BACPET/ pathDB data/Database/ pathPRB data/PRBs/BACPET/ pathData data/OUT/ ch_name1 SEQ_VAR h03 ch_name2 .ref pathMulti pathParallelThis can be used as a template for future assemblies. A suggested format is included here. These parameters describe the SeedSize, dataset and database names, the folders containing the reference sequence in fasta format (genome_dir), database, input prb files, and output folders. We also provide information on the reference sequence files. For this sample dataset:
- The seed size is 31. The smaller the seed size is, the faster the alignment will by.
- skip sets the number of bases will be skipped from the start of each sequence. In this case, it is set to -1; if Slider believes that the first base is adapter sequence (based on it repeating), this base will be eliminated. Set to 0 tells Slider not to skip the first base and 1 or more tells it to skip the first number of bases specified.
- Its output will be placed in: data/OUT/ (this path needs to be created before running Slider)
- Its prb files are in: data/PRBs/BACPET/*/*prb*. Note that Slider requires a subdirectory (of any name) inside the PRBs/DATASET directory that will hold the prb files. This allows flexibility of using prb files with the same name by placing them in different subdirectories. Note that Slider can read compressed .gz format files.
- Its reference genome is in: data/seqDB/BACPET
- Its reference genome file is: h03.ref. This is a concatenation of SEQ_VAR and ch_name2. The parameter, ch_name1 would be place before SEQ_VAR if this is required. This allows for flexibility in defining a list of reference sequences to align to. For example, to align to two reference sequences these lines in the configuration file could look like:
ch_name1 chr SEQ_VAR 1 2 ch_name2 .fa
In this case, the alignment would occur on the reference files chr1.fa and chr2.fa which are located in the the folder that 'genome_dir' is set to.
- The reference genome database will be placed in: data/Database/ (this path needs to be created before running Slider)
- pathMulti and pathParallel are used in properties that are under development.
Once the parameters file for a dataset is
created, each command/utility can be used as:
java –cp . command parametersFile
For example, using the dataset provided, if we are using the reference
sequence for the first time, we need first to generate its database table. The full path name to the configuration file is required:
java -cp . -Xmx1G JavaCode/CreateDB2 ./input/BACPET
If we want to incorporate known SNPs in our database, we place a file of known SNPs called SNPs in the location pointed to by KnownSNPs in the configuration file as the above example shows.
The format of this file is:
"Reference ID" "Chromosome name" "Position on Chromosome (bp)" "Strand" "Allele" "Mapweight" "Ancestral allele" rs3 13 31344842 1 C/T 1 C rs4 13 31345222 1 A/G 1 A rs5 7 91677046 1 C/T 1 rs6 7 91585067 1 A/G 1 G
Then, to generate the P0_Reads table
using prb values for the dataset, we use the command CreateSequences as:
java -cp . -Xmx1G JavaCode/CreateSequences ./input/BACPET
To align these sequences to the reference sequence, we use Alignment and Extend:
java -cp . -Xmx1G JavaCode/Alignment ./input/BACPET java -cp . -Xmx1G JavaCode/Extend ./input/BACPET
And then we need to separate the alignment result into different
files associated with each reference sequence using the Separate command followed by Extend:
java -cp . -Xmx1G JavaCode/Separate ./input/BACPET
The last step provides alignment information and SNP calls:
java -cp . -Xmx1G JavaCode/CallSNPs ./input/BACPETTo score the SNP calls, use ScoreSNPs. By combining other scores, it reflects the prediction accuracy (in the range 0-100 where 100 is highest score):
java -cp . -Xmx1G JavaCode/ScoreSNPs ./input/BACPET
For viewing alignment information, this command will send the alignment results to output.txt:
java -cp . -Xmx1G JavaCode/display ./input/BACPET data/OUT/BACPET/Results/s.um.s.id/ n
This will display the first n reads of the alignment. These include unique matches or U0 or U1. The default is the first 10 lines and to display all, set n to -1.
0: h03 73027 F 1.0 37 1000000000 GQ MPS S0 UM U12 u0 217 GGCTGGGGCC ACGGGGAGCC GGAGGGAGGC GGAGGGG ZZ[[[\\]]] ^^^^^^^^^^ \\Q\\ZJZZU XXJXWXW 37 1: h03 73207 R 1.0 37 1000000000 GQ MPS S1 UM U11 u0 217 CCCGGTGCCC GCCGCGGAGC CCGCGCCCCC GGCCCCC WNWSIAUZZZ ZZ\V\T\^^^ ^^^^^^^]]] \\[[[ZZ 37 2: h03 66801 R 1.0 37 999999999 GQ MPS S0 UM U9 u1 226 GGGCTGCGAG CCCCTCcTCT CTGAGCACGG ACTCCTG NXIXOSXZZO ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ 37 <T17 (Q-28)> 3: h03 66612 F 1.0 37 999999999 Crisp MPS S1 UM U10 u0 226 CAGGCCTCTC GCCTTCTGTG GGGCAGGGCA GCCCCCA ZZ[[[\\]]] ^^^^^^^^^^ \\\\VZZZZT XPRXSTO 37 4: h03 66918 R 1.0 37 999999998 GQ MPS S0 UM U9 u0 SET CTCTGGCCAG CTGGACAGCT CTGCCCAAAT CTCAAAT WLWQNSXZZP ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ 37 5: h03 119779 F 1.0 37 999999998 LQ MPS S1 UM U9 u0 SET GCACAGGATT TCCCAGCTTC CCGCCCTTTT CCTCACa ZZ[G[\\C]] ^?^]WG^]OQ OVABGRTGTD RBQL;K< 37 C37 6: h03 61315 R 1.0 37 999999997 Crisp MPS S0 UM U15 u0 217 CCCTTGTTAC TGTCCCCGCC CCATGCCTGG GTGGTGG WXWXXXXZZZ ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ 37 7: h03 61135 F 1.0 37 999999997 Crisp MPS S1 UM U11 u0 217 CCAGAGTCCT GGAGAGCTTC CAACTCTGGC AGGGCCT ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XXXXWXW 37 8: h03 53404 F 1.0 37 999999995 Crisp MPS S0 UM U10 u0 SET TGAGGAGAGG TCAGAGAGCC AGCCCACCTC GTCGTGA ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XPOXRXN 37 9: h03 101674 F 1.0 37 999999994 Crisp MPS S0 UM U12 u0 217 CTGTGGGGAG GGCTCGAGGC ACCGACTCAC ACTCCTG ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XXXXWXW 37
Above is the information for the first 10 matches in this alignment. The columns are as follows:
1. Read id
2. Reference name that read aligned to
3. Reference coordinate/position of alignment start
4. Forward or reverse strand
5. Weight of read (see paper for further details)
6. Read length
7. Alignment id
8. Quality of read ("Crisp", "LQ" for low quality, or "GQ" - see paper for further details)
9. MPS (most probable sequence) or RS
10. Read end (if a PET run, can be either S0 or S1)
11. UM = unique match or MM=multiple match
12. How long read must be to be unique (short will be more accurate)
13. Base mismatch - u0 (no mismatch bases) or u1 (1 base mismatch)
14. SET (single end tag) or a number if PET (fragment size = gap + length of both reads)
15. Sequence of aligned read
16. Quality string - quality of read (qcal)
16. Length of read
17. Information on mismatch bases, blank if there are none
To display the first n reads from the oligo input table (before alignment)
java -cp . -Xmx1G JavaCode/display ./input/BACPET data/OUT/BACPET/s.ol0/ n
If n is larger than the number of reads in the oligo table, the full
oligo table is displayed. This can be used to convert an oligo table to a text
format. The contents of this file is sorted lexographically.
988: 1.0 31 999985733 Crisp MPS S1 u1 AAAAGTCAGA GGCCACCAAA GTCCGTTTCC C 989: 1.0 31 999983044 Crisp MPS S1 u1 AAAAGTCAGA GGCCACCAAA GTCCGTTTCC C 990: 0.01 31 999997594 GQ PS S0 u1 AAAAGTCGAG GAGGTGCTGC ATCCACCACA G 991: 0.0 31 999997594 GQ PS S0 u1 AAAAGTCGAG GAGGTGCTGC GTCCACCACA G 992: 1.0 31 999982737 Crisp MPS S0 u1 AAAAGTGACT CTGATGATGA GGAGTACAAG G 993: 0.0 31 999996814 GQ PS S0 u1 AAAAGTGCCC TTGCCTTCTC TATCCTCTTC A 994: 0.01 31 999996814 GQ PS S0 u1 AAAAGTGCCC TTGCCTTCTC TCTCCTCCTC A 995: 0.12 31 999996814 GQ PS S0 u1 AAAAGTGCCC TTGCCTTCTC TCTCCTCTTC A 996: 0.01 31 999996814 GQ PS S0 u1 AAAAGTGCCC TTGTCTTCTC TATCCTCTTC A 997: 0.03 31 999996814 GQ PS S0 u1 AAAAGTGCCC TTGTCTTCTC TCTCCTCCTC A 998: 0.38 31 999996814 GQ MPS S0 u1 AAAAGTGCCC TTGTCTTCTC TCTCCTCTTC A 999: 0.15 31 999987179 LQ PS S1 u1 AAAAGTGTAC ACATGTGGGC AAAAATACATAbove is extracted from the sample oligo table.
An unfiltered list of SNPs produced by SNPs is located at: data/OUT/BACPET/Results/h03.ref.snps. For this dataset, 290 SNPs were found (first ten displayed below).
1 11652 A G 5 5 4 65 28 G 29 G 0 0 100 0 Novel H1 2 12431 T C 10 10 5 29 30 C 30 C 0 100 0 0 Novel H1 3 12575 G T 13 13 8 57 30 T 30 T 0 0 0 100 Novel H1 4 13268 A G 4 4 3 28 27 G 27 G 0 0 100 0 Novel H1 5 13375 C G 15 15 7 49 30 G 30 G 0 0 100 0 Novel H1 6 13792 A G 27 27 11 53 30 G 30 G 0 0 100 0 Novel H1 7 13829 A G 10 10 5 66 30 G 30 G 0 0 100 0 Novel H1 8 14427 A G 11 11 5 36 30 G 30 G 0 0 100 0 Novel H1 9 14917 A M 3 3 5 51 29 A 29 A 10 90 0 0 Novel H2 10 15181 A C 17 17 7 46 30 C 30 C 0 100 0 0 Novel H1
The filtered/scored list of SNPs is located at: data/OUT/BACPET/Results/filter/h03.ref.snps which is the same formatting as above but with a SNP Score value at the beginning of each line. The previous list has been filtered down to 270 SNPs (first ten displayed below):
19 1 11652 A G 5 5 4 65 28 G 29 G 0 0 100 0 Novel H1 39 2 12431 T C 10 10 5 29 30 C 30 C 0 100 0 0 Novel H1 43 3 12575 G T 13 13 8 57 30 T 30 T 0 0 0 100 Novel H1 19 4 13268 A G 4 4 3 28 27 G 27 G 0 0 100 0 Novel H1 74 5 13375 C G 15 15 7 49 30 G 30 G 0 0 100 0 Novel H1 79 6 13792 A G 27 27 11 53 30 G 30 G 0 0 100 0 Novel H1 34 7 13829 A G 10 10 5 66 30 G 30 G 0 0 100 0 Novel H1 58 8 14427 A G 11 11 5 36 30 G 30 G 0 0 100 0 Novel H1 6 9 14917 A M 3 3 5 51 29 A 29 A 10 90 0 0 Novel H2 86 10 15181 A C 17 17 7 46 30 C 30 C 0 100 0 0 Novel H1Above is the information for the first 10 matches in this alignment. The columns are as follows:
1. Score - generated by ScoreSNPs by combining other scores; it reflects the prediction accuracy (in the range 0-100 where 100 is highest score)
6. Score1 - this score is based on the accumulated probability of the coverage and the expected percentage of SNPs in the reference genome.
7. Score2 - this is the same as score1 but takes into account a priori knowledge of SNPs already identified; if the SNP database was not built with a list of known SNPs, this value will be the same as Score1
9. Average location of the SNP in the reads
10. This is used by ScoreSNPs
11. This is used by ScoreSNPs
12. This is used by ScoreSNPs
13. This is used by ScoreSNPs
14. Accumulated probability of base A
15. Accumulated probability of base C
16. Accumulated probability of base G
17. Accumulated probability of base T
18. If using known SNPs as prior knowledge, this value is Known if SNP is in the database and otherwise if it is not in the database or not using a SNP database this is listed as Novel.
19. H1 - homozygous, H2 - heterozygous