SSAKE 3.2 (Dec 07, 2007)

SSAKE 3.2 adjusts contig ends to find new extension possibilities. A bug that prevented SSAKE from exploring the entire read space for contig extensions seeded by shorter reads has been FIXED

For additional information about this project, please visit the overview page .

Available downloads

ssake_v3-2.tar.gz

For all platforms (1.7 MB)

Release Notes

State Final release
License GPL
Release Manager Rene Warren
  • SSAKE 3.2 adjusts contig ends to explore new extension possibilities.

The new option (-t) will allow contig trimming in 3', one base at a time until a maximum base trim value (-t) is reached.  This option yields longer contigs, but increases assembly run time and, at high -t values, might introduce contig misassemblies if your other run parameters (i.e. -m, -o and -r) are not stringent enough.  At -m 16 -o 3 -r 0.7, best results were obtained at -t 1.  That's because it removes bases that cause premature breaks during the fragment assembly.  If set, end-trimming kicks in only when all possibilities have been exhausted for a contig extension.
This release also fixes a major bug that prevented SSAKE from exploring the entire read space for contig extensions seeded by shorter reads.


  • SSAKE 3.1+ allows users to input a fasta file with DNA sequences for use as seeds to nucleate contig extension.

This feature can be used to extend existing/known DNA sequences using millions of short reads.  A new input format is now needed for paired-end reads which allows reads of variable length to be considered.


  • SSAKE 3.0+ was developed using the previous point release, v2.0

New functions that use paired-end short read data to build scaffolds were implemented.  SSAKE v3.0+ tracks all paired reads in the assembly and outputs stats on pairing validity that can be used to identify putative misassemblies.


---------------- 

Building Scaffolds with SSAKE 3.0+

If the -p option is set to 1, it is assumed that the data supplied in the fasta file (-f) consists of paired-end reads, concatenated together, each starting with a focus base (not used for the assembly) -- see "Input sequences" in the SSAKE.readme file included with this release.

During data input, pairs are split and both used to fill the prefix tree and hash table, as described in Warren et al. 2007.
With the -p option set, the position of all sequence reads in contigs -z and larger are tracked.
If a file is specified with -g, its unpaired sequences will be co-assembled along with paired reads during the SSAKE 3' extension but the former reads will NOT be tracked.

At the end of the overlap phase (aka contig extension), the -f fasta file is read again, associating reads with their pairs.
For each read pairs, putative contig pairs (pre-scaffolding stage) are tallied based on the position/location of the paired-end reads on different contigs.  Contig pairs are only considered if the calculated distance between them satisfy the mean distance specified (-d) while allowing for a deviation (-e), also defined by the user. Only contig pairs having a valid gap or overlap are allowed to proceed to the scaffolding stage.
Please note that this stage accepts redundancy of contig pairs (i.e. a given contig may link to multiple contigs, and the number of links (spanning pairs) between any given contig pair is recorded, along with a mean putative gap or overlap(-)).
Once pairing between contigs is complete, the scaffolds are built using contigs (-z or larger) as seeds.  Every contig is used in turn until all have been incorporated into a scaffold.

Two parameters control scaffolding (-k and -a).  The -k option specifies the minimum number of links (read pairs) a valid contig pair MUST have to be considered.  The -a option specifies the maximum ratio the best two contig pairs for a given seed/contig being extended.  For example, contig A shares 4 links with B and 2 links with C, in this orientation.  contig rA (reverse) also shares 3 links with D.   When it's time to extend contig A (with the options -k and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered.  Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold.  A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links -- SSAKE will accept the first one since both have a valid gap/overlap.  When a scaffold extension is terminated on one side, the scaffold is extended on the "left", by looking for contig pairs that involve the reverse of the seed (in this example, rAD).  With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by SSAKE would be:
1) rD-A-B
2) C

SSAKE outputs a .scaffolds file with linkage information between contigs (see "Understanding the .scaffolds csv file" section in the SSAKE.readme file included with this release)

Accurate scaffolding depends on many factors.  Number and nature of repeats in your target sequence, optimum adjustments of -d -e -k and -a and data quality/size of sequence set (more doesn't mean better) will all affect SSAKE's ability to build scaffolds.


*refer to the SSAKE.readme file for additional information


Change log