Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home Platforms Bioinformatics GSC Software Centre SSAKE Releases SSAKE 2.0 (Beta release)

SSAKE 2.0 (Beta release) (Oct 23, 2007)

This is not a final release. Experimental releases should only be used for testing and development. Do not use these on production sites, and make sure you have proper backups before installing.

SSAKE can now handle error-rich [short sequence] data sets. For each seed sequence or contig being extended, SSAKE looks through the entire overlapping k-mer space and generates a consensus sequence from overhanging bases - It then extends contigs using that consensus, provided the bases it comprises pass user-defined thresholds.

For additional information about this project, please visit the overview page .

Available downloads

ssake_v2-0.tar.gz

For Linux (2.8 MB)

Release Notes

State Beta release
License GPL
Release Manager Rene Warren

SSAKE is implemented in perl & python (v 2.0) and runs on linux.
Error handling by deriving a overhang consensus from overlapping reads is an original idea by William Jeck & Josie Reinhardt (VCAKE v1.0 William Jeck, May 2007) and implemented in SSAKE from scratch.

Side-by-side comparison between ssake2.0 and vcake1.0 indicates that ssake's assembly algorithm is nearly 3-fold faster and yields contigs that are as contiguous and accurate.

Due to SSAKE's memory requirements, you would need a version
of the perl interpreter compiled for 64-bit computers if you intend to cluster millions of short sequences.
Development of SSAKE was done using perl v5.8.5 built for x86_64-linux-thread-multi

You can cluster ~5 million 25-mers with SSAKE on a computer with 4GB RAM
You can cluster 60-80 million 25-mers with SSAKE on a computer with 32GB RAM


How SSAKE works
-----------------------

Short DNA sequences of length l in a single multi fasta file -f are read in memory, populating a hash table keyed by unique sequence reads with pairing values representing the number of sequence occurrence in the input read set.  The normalized sequence reads are sorted by decreasing abundance (number of times the sequence is repeated) to reflect coverage and minimize extension of reads containing sequencing errors.  Reads having sequencing errors are more likely to be unique in the entire read set when compared to their error-free counterparts.  Sequence assembly is initiated by generating the longest 3'-most word (k-mer) from the unassembled read u that is shorter than the sequence read length l.  Every possible 3' most k-mers will be generated from u and used in turn for the search until the word length is smaller than a user-defined minimum, m.  Meanwhile, all perfectly overlapping reads will be collected in an array and further considered for 3' extension once the k-mer search is done.  At the same time, a hash table c will store every base along with a coverage count for every position of the overhang (or stretches of bases hanging off the seed sequence u).

Once the search complete, a consensus sequence is derived from the hash table c, taking the most represented base at each position of the overhang.  To be considered for the consensus, each base has to be covered by user-defined -o (set to 2 by default).  If there's a tie (two bases at a specific position have the same coverage count), the prominent base is below a user-defined ratio r, the coverage -o is to low or the end of the overhang is reached, the consensus extension terminates and the consensus overhang joined to the seed sequence/contig.  All reads overlapping are searched against the newly formed sequence and, if found, are removed from the hash table and prefix tree.  If they are not part of the consensus, they will be used to seed/extend other contigs, if applicable.  If no overlapping reads match the newly formed contig, the extension is terminated from that end and ssake resumes with a new seed.  That prevents infite loop through low-complexity DNA sequences.  In the former case, the extension resumes using the new l-m space to search for joining k-mers.

The process of progressively cycling through longer to shorter 3'-most k-mer is repeated after every sequence extension until nothing else can be done on that side.  Since only left-most searches are possible with a prefix tree, when all possibilities have been exhausted for the 3' extension, the complementary strand of the contiguous sequence generated is used to extend the contig on the 5' end.  The DNA prefix tree is used to limit the search space by segregating sequence reads and their reverse-complemented counterparts by their first eleven 5' end bases.

There are three ways to control the stringency in SSAKE:
1) Disallow read/contig extension if the coverage is too low (-o).  Higher -o values lead to shorter contigs, but minimizes sequence misassemblies.
2) Adjust the minimum overlap -m allowed between any two sequence k-mer.  Higher m values lead to more accurate contigs at the cost of decreased contiguity.
3) Set the minimum base ratio -r to higher values

Change log