[Ssrformat] Major illumina2srf bug

James Bonfield jkb at sanger.ac.uk
Wed Jul 9 06:44:26 PDT 2008


Hello all,

Io_lib 1.11.2 contained a bugged version of illumina2srf (formerly
known as solexa2srf) which produced incorrect SRF files when storing
raw (.int/.nse) data. This document and source explains the problem,
how to spot it and how to fix it.

This bug does not exist if only one of each ZTR chunk type is present,
which is the default behaviour for illumina2srf.

Please accept my apologies for not noticing this problem earlier.

    James Bonfield (jkb at sanger.ac.uk)


The problem
-----------

SRF files produced using illumina2srf with the "-r" option (or more
specifically any two ZTR SMP4 chunks, but that inherently requires raw
data to be output) may sometimes incorrectly label a ZTR trace chunk
with the incorrect meta-data. The consequence is that the ZTR traces
inside the SRF files are produced with, for example, apparently two
processed SMP4 chunks and one SLXN, but not SLXI. 

Note that the SRF file is correctly formatted, as are the ZTR files
within them, so the problem may not be immediately obvious.

Fastq format data is not affected, so use of the corrupted SRF files
for sequence alignment is fine.


Versions affected
-----------------

Bugged:   io_lib-1.11.2   (illumina2srf v1.7)
	  io_lib-1.11.0b8 (illumina2srf v1.3)
	  
Working:  io_lib-1.11.3   (illumina2srf v1.8, to be imminently released)
	  io_lib-1.11.0   (illumina2srf v1.4)
	  io_lib-1.11.0b7 and earlier

(Note that io_lib 1.11.1 was an internal version only and not
officially released as a tar-ball.)

Anyone using CVS checkouts may have versions between official
releases, so the specific point the bug existed is between these two
points:

causes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.9&r2=1.10
avoids bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.15&r2=1.16

and also between these two points:

causes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.28&r2=1.29
fixes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.31&r2=1.32


Detecting and fixing the files
------------------------------

The nature of the problem, mislabelling of some ZTR meta-data, means
that the problem can be reversed without full reprocessing. So we
provide a srf_check tool to identify problematic SRF files and a
srf_fix tool to correct them.

srf_check [-f] filename.srf ...

    Specify one or more srf files on the command line. The -f option
    is a full check, but it's not recommended as it takes a long time
    (similar to srf2fastq) compared to ~40ms without it. The quicker
    mode simply checks the first few traces as the problem, if it
    exists, will be consistent throughout the entire data set.


srf_fix input.srf output.srf

    This reorders the ZTR chunks to reattach the correct meta-data
    with the correct trace. It will take of the order of 5 minutes per
    lane (greatly depending on cluster density and I/O speed of
    course).


The srf2illumina program also reveals the problem if you study the
output. Specifically you will see that there are very few entries in
one of the .int.txt, .nse.txt or .sig2.txt files and more entries in
another of the 3.


Building the programs
---------------------

You will need to build and install io_lib-1.11.3 too.
At the time of writing this document it has not yet been released, but
it will be imminently. If you urgently need to compile these tools
see the io_lib CVS tree:

    http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/

[I am releasing this document in advance simply to stem the flow of
interruptions which ironically are preventing me from releasing io_lib.
:-)]

Then simply type "make" within the fix_srf directory and hopefully it
should build cleanly.


Detailed post-mortem
--------------------

These gory details are written here purely for those who want to know
the full cause of the bug and the solution.

Cause of bug:

Recall that ZTR files are split in two parts within SRF: the start is in the
SRF data block header and the remainder is in the SRF data block. The
two are concatenated together to form a single valid ZTR file.

In order to facilitate this we use a function named reorder_ztr() to
sort the ZTR chunks into a specific order, with the common components
suitable for the SRF header block at the start. The bug is caused by
this re-sort not being a "stable" sort. (See section "Stability" in
http://en.wikipedia.org/wiki/Sorting_algorithm).

The impact is that given multiple chunks of type SMP4, let us call
them chunk A, B and C, the reorder_ztr function may one time put them
in order A, B, C and a second time put them in order B, A, C. This in
itself would not be a problem except for the fact that the meta-data
from the first SMP4 chunk is placed in the SRF data block header,
meaning we could accidentally get the meta-data for chunk A being
attached to chunk B in the above example. It is however entirely
deterministic with the variability coming from the number of chunks
being sorted. 

The first ZTR file output per tile contains multiple HUFF chunks. In
theory all subsequent ones do too, but given that we know these are
written to the SRF data block header they are ommitted in subsequent
traces as a speed optimisation. This leads to the reorder_ztr function
being called with a differing number of chunks and hence is the
trigger for the variable sort order.


Detecting it:

To detect the problem the srf_check program looks for multiple SMP4
ZTR chunks with identical meta-data. This should never happen and when
it does it indicates a chunk reordering.


Fixing it:

To achieve maximum compression the three SMP4 chunks are compressed
using their own data-type specific huffman trees. We utilise this to
identify the real type of the data rather than as is claimed by the
meta-data. This then allows us to reorder the data for all traces to
match the order output for the first trace per tile.

-- 
James Bonfield (jkb at sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: srf_fix-1.0.tar.gz
Type: application/octet-stream
Size: 7969 bytes
Desc: not available
Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080709/5296dfb4/srf_fix-1.0.tar.obj


More information about the Ssrformat mailing list