From asims at bcgsc.ca Wed Jul 2 20:01:44 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Wed Jul 2 20:03:02 2008 Subject: [Ssrformat] no telecon + comments Message-ID: <86C6E520C12E52429ACBCB01546DF4D30167980B@xchange1.phage.bcgsc.ca> There is no telecon tomorrow, but I will plan one for the next few weeks if there are issues to review. If you have comments for the SRF or ZTR spec revisions, please make them known by the end of next week. Best Regards, Asim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080702/fbe51a15/attachment.htm From jkb at sanger.ac.uk Wed Jul 9 06:44:26 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Wed Jul 9 06:45:35 2008 Subject: [Ssrformat] Major illumina2srf bug Message-ID: <20080709134426.GH27181@sanger.ac.uk> Hello all, Io_lib 1.11.2 contained a bugged version of illumina2srf (formerly known as solexa2srf) which produced incorrect SRF files when storing raw (.int/.nse) data. This document and source explains the problem, how to spot it and how to fix it. This bug does not exist if only one of each ZTR chunk type is present, which is the default behaviour for illumina2srf. Please accept my apologies for not noticing this problem earlier. James Bonfield (jkb@sanger.ac.uk) The problem ----------- SRF files produced using illumina2srf with the "-r" option (or more specifically any two ZTR SMP4 chunks, but that inherently requires raw data to be output) may sometimes incorrectly label a ZTR trace chunk with the incorrect meta-data. The consequence is that the ZTR traces inside the SRF files are produced with, for example, apparently two processed SMP4 chunks and one SLXN, but not SLXI. Note that the SRF file is correctly formatted, as are the ZTR files within them, so the problem may not be immediately obvious. Fastq format data is not affected, so use of the corrupted SRF files for sequence alignment is fine. Versions affected ----------------- Bugged: io_lib-1.11.2 (illumina2srf v1.7) io_lib-1.11.0b8 (illumina2srf v1.3) Working: io_lib-1.11.3 (illumina2srf v1.8, to be imminently released) io_lib-1.11.0 (illumina2srf v1.4) io_lib-1.11.0b7 and earlier (Note that io_lib 1.11.1 was an internal version only and not officially released as a tar-ball.) Anyone using CVS checkouts may have versions between official releases, so the specific point the bug existed is between these two points: causes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.9&r2=1.10 avoids bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.15&r2=1.16 and also between these two points: causes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.28&r2=1.29 fixes bug: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/progs/solexa2srf.c?r1=1.31&r2=1.32 Detecting and fixing the files ------------------------------ The nature of the problem, mislabelling of some ZTR meta-data, means that the problem can be reversed without full reprocessing. So we provide a srf_check tool to identify problematic SRF files and a srf_fix tool to correct them. srf_check [-f] filename.srf ... Specify one or more srf files on the command line. The -f option is a full check, but it's not recommended as it takes a long time (similar to srf2fastq) compared to ~40ms without it. The quicker mode simply checks the first few traces as the problem, if it exists, will be consistent throughout the entire data set. srf_fix input.srf output.srf This reorders the ZTR chunks to reattach the correct meta-data with the correct trace. It will take of the order of 5 minutes per lane (greatly depending on cluster density and I/O speed of course). The srf2illumina program also reveals the problem if you study the output. Specifically you will see that there are very few entries in one of the .int.txt, .nse.txt or .sig2.txt files and more entries in another of the 3. Building the programs --------------------- You will need to build and install io_lib-1.11.3 too. At the time of writing this document it has not yet been released, but it will be imminently. If you urgently need to compile these tools see the io_lib CVS tree: http://staden.cvs.sourceforge.net/staden/staden/src/io_lib/ [I am releasing this document in advance simply to stem the flow of interruptions which ironically are preventing me from releasing io_lib. :-)] Then simply type "make" within the fix_srf directory and hopefully it should build cleanly. Detailed post-mortem -------------------- These gory details are written here purely for those who want to know the full cause of the bug and the solution. Cause of bug: Recall that ZTR files are split in two parts within SRF: the start is in the SRF data block header and the remainder is in the SRF data block. The two are concatenated together to form a single valid ZTR file. In order to facilitate this we use a function named reorder_ztr() to sort the ZTR chunks into a specific order, with the common components suitable for the SRF header block at the start. The bug is caused by this re-sort not being a "stable" sort. (See section "Stability" in http://en.wikipedia.org/wiki/Sorting_algorithm). The impact is that given multiple chunks of type SMP4, let us call them chunk A, B and C, the reorder_ztr function may one time put them in order A, B, C and a second time put them in order B, A, C. This in itself would not be a problem except for the fact that the meta-data from the first SMP4 chunk is placed in the SRF data block header, meaning we could accidentally get the meta-data for chunk A being attached to chunk B in the above example. It is however entirely deterministic with the variability coming from the number of chunks being sorted. The first ZTR file output per tile contains multiple HUFF chunks. In theory all subsequent ones do too, but given that we know these are written to the SRF data block header they are ommitted in subsequent traces as a speed optimisation. This leads to the reorder_ztr function being called with a differing number of chunks and hence is the trigger for the variable sort order. Detecting it: To detect the problem the srf_check program looks for multiple SMP4 ZTR chunks with identical meta-data. This should never happen and when it does it indicates a chunk reordering. Fixing it: To achieve maximum compression the three SMP4 chunks are compressed using their own data-type specific huffman trees. We utilise this to identify the real type of the data rather than as is claimed by the meta-data. This then allows us to reorder the data for all traces to match the order output for the first trace per tile. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- A non-text attachment was scrubbed... Name: srf_fix-1.0.tar.gz Type: application/octet-stream Size: 7969 bytes Desc: not available Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080709/5296dfb4/srf_fix-1.0.tar.obj From jkb at sanger.ac.uk Wed Jul 9 08:53:06 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Wed Jul 9 08:53:14 2008 Subject: [Ssrformat] Major illumina2srf bug In-Reply-To: <20080709134426.GH27181@sanger.ac.uk> References: <20080709134426.GH27181@sanger.ac.uk> Message-ID: <20080709155306.GK27181@sanger.ac.uk> On Wed, Jul 09, 2008 at 02:44:26PM +0100, James Bonfield wrote: > You will need to build and install io_lib-1.11.3 too. > At the time of writing this document it has not yet been released, but > it will be imminently. If you urgently need to compile these tools > see the io_lib CVS tree: The above is no longer relevant, sorry for the delay. Io_lib-1.11.3 can now be downloaded from: https://sourceforge.net/project/showfiles.php?group_id=100316&package_id=108243 Please let me know of any issues with this. Thanks, James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Thu Jul 17 02:25:59 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jul 17 02:26:09 2008 Subject: [Ssrformat] Unique read IDs Message-ID: <20080717092559.GY5313@sanger.ac.uk> What are centres doing regarding read names? It's clear that the read names become important when attempting to index data. Specifically if a centre like EBI or NCBI is requested for a single individual trace by the original submitted read name, they need a way to pull it out. Building indices on every single name in every single SRF would be vast. Read names consist of a prefix and a suffix, as detailed in the SRF spec. The *assumption* so far is that people will actually use this and so at worst case data warehouses would need to index 1 prefix per data block header. A more useful assumption further still is that the prefix in the data block headers also have their own prefix. This is typically how illumina2srf works. Eg some of a Sanger read may be "IL9_906:1:70:803:660". Of this the "IL9_906:1:70:" part is in a data block header and "803:660" is in the individual trace data block itself. Of the header, ALL header blocks in this example (1 lane) share "IL9_906:1:" in their prefix (that being the machine name, run ID and lane number). A full run SRF would have "IL9_906:" as a global prefix instead as the lane part would vary. So there's some important concepts here: 1) There is one single prefix which is global to the entire file. Every read in the file shares this same prefix. 2) All such global prefixes in data produced by the Sanger Institute should be unique. That is a database could pair a single prefix to a single SRF file in order to rapidly work out which SRF file will contain any specific read name. 3) Our read names, although unique to us, may not be globally unique. The SRF specification (6.5.3.1) has information regarding this. My interpretation is that we should be prepending them with NSC. 4) In order for the data to be searchable in a pratical manner, we may wish to dictate that there is a single globally unique name prefix for the entire SRF file. Currently we are under no obligation to ensure that the prefix strings in data block headers share a common prefix between each other, or indeed that an SRF file even makes use of the prefix. It would even be permissible to replace all read names with md5sums and essentially make it impossible to figure out which name belongs in which SRF file. So what are EBI and NCBI doing here? Are they requiring that all data submitted to them MUST start with either V or N (vendor name or sequencing centre name)? James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.