SSAKE
SSAKE is a genomics application for assembling millions of very short DNA sequences.
Current release
SSAKE 3.5
Released May 28, 2010
v3.5+ Uses mate pairs to help resolve repeats (preventing contig misassemblies) at run time and attempts to force-fill gaps with redundant sequences (improves contiguity and repeat resolution).
More about this release…
-
Get
SSAKE
for
all platforms
(2.2 MB)
- ssake_v3-5-tar.gz
Project Description
The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree. SSAKE is designed to help leverage the information from short sequences reads by stringently clustering them into contigs that can be used to characterize novel sequencing targets.
*Best performance is achieved by quality-trimming your reads before assembly using TQS.py (part of standard SSAKE release).
Enjoy SSAKE responsibly!
Summary
SSAKE is written in PERL and runs on Linux. SSAKE cycles through short sequence reads stored in a hash table and progressively searches through a prefix tree for the longest possible identical overlap between any two sequences. The algorithm was used to assemble 25-36 bp sequence reads from viral, bacterial and fungal genomes and on forty millions 25-mers simulated using the whole-genome shotgun (WGS) sequence data from the Sargasso sea metagenomics project. Considering the number of sequences to assemble, SSAKE is robust and tractable.
![]() |
LEFT: 25-mers generated at random from Sargasso Sea genome shotgun Sanger-reads were assembled by SSAKE v1.0 using -m 16. Benchmarking was done on a 2.2GHz AMD Opteron computer with 4GB RAM and 1.4GHz AMD Opteron computer with 32GB RAM.
RIGHT: A bacterial genome in 299 contigs, assembled with SSAKE in ~2h. on a Mac mini, 2.0GHz Intel Core Duo with 2GB RAM (30min. on a 2.4GHz Opteron with 32GB RAM). Showing SSAKE v3.2 -t 2 and -t 0 contigs (blue and red rectangles, respectively) aligning to the Helicobacter acinonychis strain Sheeba genome. Illumina sequence data obtained from Dohm et al (2007) was quality-trimmed with TQS.py (-t 10 -d 10 -c 25 -l 36) and the 5.7M remaining unpaired reads assembled with SSAKE (v3.2 -m 16 -o 3 -r 0.7 -t 0 or 2). Repeats and low complexity regions were identified with cross_match (-minmatch 30 -minscore 60) and RepeatMasker and are shown in green and orange on the circos plot, respectively. SSAKE v3.2 -t 2 yielded 299 contigs >= 100nt, contig N50 length = 9,961 bp, Max contig = 49.5 kbp, Mean contig = 5,122 bp, Accuracy = 99.95%.
Documentation
René L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. 2007 (epub 2006 Dec 8). Assembling millions of short DNA sequences using SSAKE. bioinformatics. 23:500-501.
-The powerpoint poster shown above was presented at the Pacific Symposium on Biocomputing (January 2008) and serves as an introduction to the algorithm and key concepts behind SSAKE.
-Still don't know if SSAKE is right for you? These PDF slides should give you some idea for potential applications, how to run SSAKE to obtain optimal results & make the most of your data.
License
Copyright (c) 2006-2009 Canada's Michael Smith Genome Science Centre. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Credits
René Warren, Granger Sutton, Steven Jones and Robert Holt

