SSAKE

De novo genome assembly with short DNA sequence reads

Project Description

SSAKE

SSAKE is a de novo assembler for short DNA sequence reads. It is designed to help leverage the information from short sequences reads by assembling them into contigs and scaffolds that can be used to characterize novel sequencing targets. SSAKE is the first published algorithm for genome assembly with short DNA sequences. It assembles whole reads (not k-mers) and as such, is well-suited for structural variant assembly/detection.  In 2016, SSAKE celebrates its 10th anniversary!

Genomic tools that use SSAKE algorithms

Algorithms of SSAKE are the core of many genomics applications (eg. VCAKE, QSRA, SHARCGS, SSPACE, JR-Assembler) and their design continues to inspire new-generation assemblers (eg. JR-Assembler, PNAS 2013) (Figure 1). Applications of SSAKE extend beyond genome assembly and the technology was applied to profiling T-cell metagenomestargeted de novo assemblyHLA typing and was key to the discovery of Fusobacterium in colon cancer.

US National Marrow Donor Program / Be The Match

The US National Marrow Donor Program (NMDP)®/Be The Match® relies on SSAKE for HLA consensus assembly, which is cornerstone to their allele interpretation pipeline designed to match Donors and Recipients. The NMDP®/Be The Match® is a global leader in facilitating bone marrow and umbilical cord blood transplants to save the lives of patients with leukemia, lymphoma, genetic disorders, and other diseases.

ICGC-TCGA DREAM Challenge

SSAKE is the assembly engine in the top-performing cancer genomic structural variant predictor pipeline software novoBreak in The ICGC-TCGA DREAM Genomic Mutation Calling Challenge.

 

Summary

SSAKE is written in PERL and runs on Linux. SSAKE cycles through short sequence reads stored in a hash table and progressively searches through a prefix tree for extension candidates. The algorithm assembled 25 to 300 bp [genome, transcriptome, amplicon] reads from viral, bacterial and fungal genomes.  SSAKE is lightweight, simple to setup & run and robust.

*Best performance is achieved by quality-trimming your reads before assembly (refer to the tools folder and SSAKE.readme/SSAKE.pdf and example assembly shell scripts in the test directory)

Enjoy SSAKE responsibly!

About the author: www.renewarren.ca

 

Experimental, NGS test data


1) Zaire ebolavirus isolate Ebola 

A genome in a single contig (18.9 kbp) in less than 2 minutes*

Experimental Illumina MiSeq sequence dataset (PE254, [SRR2019530]) available for testing SSAKE:  ftp://ftp.bcgsc.ca/supplementary/SSAKE

To download and assemble, run: (cd test; ./MiSeqEbolaAssemblyPIPELINE.sh)

* Benchmarked on CentOS 7.1 64 dual core Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz using 350MB RAM


2) Escherichia coli K12

A contiguous E.coli K12 assembly of quality-trimmed Illumina MiSeq PE300 in 20minutes*

To download and assemble, run: (cd test; ./MiSeqEcoliAssembly250XPE300.sh)

E.coli K12 SSAKE assembly scaffold statistics with Illumina MiSeq
n (500)Largest (bp)L50N50 length (bp)Reconstruction (bp)
82 1,193,524

5

204,889 4,606,049

*Benchmarked on Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz 128CPU 2TB RAM CentOS7/ 1 thread per assembly. Wall clock time: 21m50s. Utilized 9 GB RAM

 

3) Campylobacter showae CC57C

Experimental, quality-trimmed, Illumina MiSeq sequence dataset (PE150, Colorectal cancer tumor isolate bacteria C. showae CC57C [PRJNA189774]) available for testing SSAKE:  ftp://ftp.bcgsc.ca/supplementary/SSAKE

To download and assemble, run: (cd test; ./MiSeqCampylobacterAssembly.sh)


If you use the data in your research, please cite:

Warren RL, Freeman DJ, Pleasance S, Watson P, Moore RA, Cochrane K, Allen-Vercoe E, Holt RA. 2013. Co-occurrence of anaerobic bacteria in colorectal carcinomasMicrobiome 1:16

 

4) Fusobacterium nucleatum CC53

We provide quality-trimmed HiSeq reads for a F. nucleatum CRC tumor isolate CC53.

To download and assemble, run: (cd test; ./HiSeqFusobacteriumAssembly.sh)

If you use the data in your research, please cite:

Castellarin M*, Warren RL*, Freeman JD, Dreolini L, Krzywinski M, Strauss J, Barnes R, Watson P, Allen-Vercoe E, Moore RA, Holt RA.  2012. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res. 22:299


5) De Novo Targeted Assembly of a TMPRSS2:ERG fusion in a prostate cancer RNA-Seq dataset

To download and assemble, run: (cd test; ./runSSAKEtargeted.sh) 

 

6) C. elegans linked-read de novo assembly

To download and assemble, run: (cd test; ./CelegansLinkedReadsAssembly.sh

 

Citing

If you use SSAKE in your research, please cite:

Warren RL, Sutton GG, Jones SJM, Holt RA. 2007 (epub 2006 Dec 8). Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23:500


License

Copyright (c) 2006-2017 Canada's Michael Smith Genome Science Centre. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Current Release
SSAKE 4.0

Released Jan 20, 2018

Initial support for linked reads
More about this release…

Download file Get SSAKE for all platforms
ssake_v4-0.tar.gz

All Releases

Version Released Description Compatibility Licenses Status
4.0 Jan 20, 2018 Initial support for linked reads More about this release… GPLv3 final
3.8.5 Apr 18, 2017 Implements targeted de novo assembly. Bug fix (in targeted assembly mode) More about this release… GPL final
3.8.4 Jan 01, 2016 Targeted de novo assembly functionality improvements More about this release… GPL final
3.8.3 Jun 09, 2015 Included tie-breaker option (-q) when determining consensus from equal-coverage bases, useful when the read coverage is very low. The Zaire ebola virus read data (SRR2019530) is now provided for testing SSAKE. More about this release… GPL final
3.8.2 Apr 25, 2014 This release includes an option (-j) for adjusting the kmer length when running SSAKE in TASR mode (-s). A recent Illumina MiSeq dataset is available for testing SSAKE's performance: ftp://ftp.bcgsc.ca/supplementary/SSAKE More about this release… GPL final
Project Resources

Project owner: Rene Warren