From asims at bcgsc.ca Mon Sep 1 17:47:53 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon Sep 1 17:48:17 2008 Subject: [Ssrformat] Questions in prep for SFF conversion References: <211F72AAC0FAAA4D8C4C89FF326D43F599F57D@rnumsem04.nala.roche.com> Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679815@xchange1.phage.bcgsc.ca> James, Please see below Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Knight, James Sent: Thu 07/08/2008 10:50 AM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Questions in prep for SFF conversion I'm beginning the definition and implementation of the 454 SRF format and SFF-SRF conversion software, and I have several questions right now. First, I'm looking through the two specs (SRF and ZTR) and the srf2illumina/srf2fastq code, and I can't seem to figure out where you explicitly tell the type for each read (454, Illumina, ABI, ...) when you are encoding the data block headers and data blocks. The code for srf2solexa.c and srf2fastq.c seem to assume that the SRF file they are being given is one that contains only those reads. The only part of the spec that mentions read types is the reservation of uniqueIdPrefix's for the various vendors, but these accession-related fields can be rewritten. If, for example, I'm getting some SRF file from the Broad Institute, where they have hidden all of the vendor-specific accnos, how do I tell what type or types of reads are in the file? What am I missing? Asim> There is no explicit encoding of the platform within SRF - the id prefixes are there to avoid namespace clashes. As a generic container of reads base calls from one platform should be the same as those from another. You may be able to deduce this from the basecaller. Alternatively, if the file contains vendor specific information, you may be able to deduce it from that. If the file doesn't contain vendor specific information, you can infer that that file creator does not intend to make that info available. Taking that one step farther, what are the guidelines for how to encode the read info into the file. Do the ZTR chunk types define what the data is going to be and are the only modes for storing read info (i.e., for all vendors, BASE chunks must be where the sequence is kept, BPOS chunks are the base-to-sample/flow indexes, ...), or is it more up to me to define the mapping between chunks and data fields? Asim> There are some general rules here, e.g. BASE chunks are where the sequence should be stored, however, there is flexibility too. Please propose a mapping and we'll provide feedback on it. In general, as long as it makes sense, you won't have any difficulties here. Second, James, does the addition of new data format types to the ZTR code result in a major release change to the ZTR code? As good as generic compression schemes are, I'm more of a proponent of content-aware compression, and would like to use some of the techniques that were in the SRA documentation and/or I've thought of in the encoding of the flowgram data. Specifically, I'd like to include data formats that handle the (1) limited precision floating point to integer conversions and (2) integer to X bit conversions (i.e., storing each value in X bits, with an overflow bit value to signal higher integer values, so if you have an 8 bit encoding, values 0 to 254 are stored as one 8 bit value, values 255 to 509 are stored as two 8 bit values (255 plus 0-254), and so on). If those two data formats were part of the spec, and the formats were combinable (which I thought you had said they were), then, for example, the flow-to-base values could be encoded using first a delta format converter, then the integer to 2 bit converter, and the flowgram values could be encoded as the limited precision converter followed (I was thinking) by a integer to 8 bit converter (and then we could throw on a Huffman converter, if it helps). Asim> There has been some contemplation of fixed precision floating point numbers, but this topic requires further discussion I'll be following up in the coming days with the defintions and code. The contents of the 454 read SRF entries will contain the same information as in the SFF files (unless there are objections), plus the region-based definitions of the reads (the converter code will have the tools to find and mark the key, 3' sequencing primer, paired-end linker, the two halves of paired-end reads, any multiplex IDs, ...). Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080901/cb510812/attachment.htm From asims at bcgsc.ca Mon Sep 1 17:51:22 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon Sep 1 17:51:48 2008 Subject: [Ssrformat] next telecon Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679817@xchange1.phage.bcgsc.ca> Hi all, It has been a while since we had a telecon and there are a few items to touch on e.g. ZTR 1.3 spec final comments, implementation issues, additional features. Please register your availability on the following doodle site: http://www.doodle.ch/d4tmyq43ci94sx47 Asim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080901/60e2c05d/attachment.htm From jkb at sanger.ac.uk Mon Sep 8 06:18:43 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Mon Sep 8 06:18:51 2008 Subject: [Ssrformat] Questions in prep for SFF conversion In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D301679815@xchange1.phage.bcgsc.ca> References: <211F72AAC0FAAA4D8C4C89FF326D43F599F57D@rnumsem04.nala.roche.com> <86C6E520C12E52429ACBCB01546DF4D301679815@xchange1.phage.bcgsc.ca> Message-ID: <20080908131842.GV32763@sanger.ac.uk> Hello Asim, Jim, Sorry for the delay. I've been on holiday (getting soaked, bah to the weather!). > First, I'm looking through the two specs (SRF and ZTR) and the > srf2illumina/srf2fastq code, and I can't seem to figure out where > you explicitly tell the type for each read (454, Illumina, ABI, ...) > when you are encoding the data block headers and data blocks. I don't think there is one currently. I guess it should belong in the TEXT header somewhere, although the data formats itself shouldn't be reliant on a machine manufacturer. > code for srf2solexa.c and srf2fastq.c seem to assume that the SRF > file they are being given is one that contains only those reads. For srf2solexa this is certainly designed that way. It's sole purpose was (and in theory still is) a debugging/test tool to verify that we can extract the data out that we just put in. srf2fastq shouldn't have any machine specifics as it just extracts bases and qualities which will exist on all platforms that produce such data. If we want it to support SOLiD then I guess we need to refine the fastq format (I don't think it's ever been formally documented though, nor am I sure who is responsible for it). > If, for example, I'm getting some SRF file from the Broad Institute, > where they have hidden all of the vendor-specific accnos, how do I > tell what type or types of reads are in the file? What am I > missing? This is another mine field which IMO the EBI and NCBI need to either put their foot down on quickly, or otherwise state upfront that they will not honour original read names. I think it's somewhat over the top to store reads with the original supplied read names if the names themselves have been scrambled. A site encrypting names is, by definition, not trying to store any information in the name itself. Hence it's possibly justifyable to ignore it and use your own (this is the NCBI approach maybe?) Conversely if a site wants names kept I believe it should be clearly stated in the submission specs that ALL reads in that SRF should start with a unique prefix. This means that the archive can simply store one name prefix for the entire SRF file and any individual read queries are easily looked up. (Given name fubar search the prefixes and determine that SRF A has prefix fu, then find fubar within it.) If they attempt to store the original crytographically scrambled names then it means implementing index holding every read name from every SRF file, which is a truely mammoth index and rapidly makes the archive a nightmare. This isn't really an SRF thing but more of a submission requirement, but perhaps we could add it to SRF too - the notion of a common archive prefix. > Taking that one step farther, what are the guidelines for how to > encode the read info into the file. Do the ZTR chunk types define > what the data is going to be and are the only modes for storing read > info (i.e., for all vendors, BASE chunks must be where the sequence > is kept, BPOS chunks are the base-to-sample/flow indexes, ...), or > is it more up to me to define the mapping between chunks and data > fields? > > > Asim> There are some general rules here, e.g. BASE chunks are where > the sequence should be stored, however, there is flexibility > too. Please propose a mapping and we'll provide feedback on it. In > general, as long as it makes sense, you won't have any difficulties > here. Seconded. Hopefully it should be kind of obvious. Sequences in BASE chunks, etc. You'll have to determine yourself which chunks belong in the DBH blocks and which belong in DB blocks, but that's also obvious - if it doesn't change put it in the DBH. > Second, James, does the addition of new data format types to the ZTR > code result in a major release change to the ZTR code? It will probably result in a minor version upgrade, ie from ZTR 1.3 to 1.4. > in the encoding of the flowgram data. Specifically, I'd like to > include data formats that handle the (1) limited precision floating > point to integer conversions and Floating or fixed point? Currently 454 SFF data uses a fixed point system (multiplying by 100) rather than floating point (exponent plus mantissa). Agreed that it would be better to make this explicit rather than just having to know to divide by 100 when reading the results back. > (2) integer to X bit conversions > (i.e., storing each value in X bits, with an overflow bit value to > signal higher integer values, so if you have an 8 bit encoding, > values 0 to 254 are stored as one 8 bit value, values 255 to 509 are > stored as two 8 bit values (255 plus 0-254), and so on). This is sort of what the 16to8 and 32to8 transforms are designed for in ZTR. You store the data in full 16 or 32 bit form, run a 16 or 32-bit delta function if appropriate producing new 16 or 32-bit values, and then apply the 16to8 or 32to8 compression method to shrink it down to 8 bits per value, utilising an escape code as necessary. This is precisely how the ABI capillary data is stored and what ZTR was initially designed around. > I'll be following up in the coming days with the defintions and > code. The contents of the 454 read SRF entries will contain the > same information as in the SFF files (unless there are objections), > plus the region-based definitions of the reads (the converter code > will have the tools to find and mark the key, 3' sequencing primer, > paired-end linker, the two halves of paired-end reads, any multiplex > IDs, ...). Sounds sensible to me. Cheers, James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From shumwaym at ncbi.nlm.nih.gov Mon Sep 8 09:27:47 2008 From: shumwaym at ncbi.nlm.nih.gov (Shumway, Martin (NIH/NLM/NCBI) [E]) Date: Mon Sep 8 09:27:55 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX Message-ID: Hi, I'm inquiring on behalf of a submitter trying to construct SRF files for Illumina runs on an OSX machine. Are there any known problems with this ? We've tried many things, including paring down the dataset. The problem is that srf2fasta does not run on the output, and there are error messages produced during illumina2srf operation. Any debugging ideas would be very helpful. Thanks! martin -----Original Message----- From: Kristian Stevens [mailto:kastevens@ucdavis.edu] Sent: Monday, September 08, 2008 12:20 PM To: O'Sullivan, Christopher (NIH/NLM/NCBI) [C] Cc: Charis M. Cardeno; Chuck Langley Subject: Re: SRTA SUBMISSION I think i've exhausted all variants. I'm using io_lib-1.11.3 and the README has pretty simple configure and compile instructions for all platforms. My only option now is to try this on a generic linux platform instead of OSX. I've tried the following: Making a small archive ( 10 tiles ). Omitting the -mf option. Concatenating the files. An example command is: illumina2srf -R -p -N 080128_HWI-EAS162_0002_FC2027UA:%l:%t: -n %x:%y -o ./FC2027UA_LANE5.srf s_5_011?_seq.txt The error I get consistently: >080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. >080128_HWI-EAS162_0002_FC2027UA:5:110:120:497 TGATCAAAAGCATAGATCCTCGAATTGCCAGTGGTA >080128_HWI-EAS162_0002_FC2027UA:5:110:117:580 TAAATGTACATTGAAACATTATTTTATTTTATATAT Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. Zero or greater than one BASE chunks found. --- Martin Shumway, Staff Scientist DHHS/NIH/NLM/NCBI 45 Center Drive Bldg. 45/Room 6AS37D-53 MSC 6510 Bethesda, MD 20892 shumwaym@ncbi.nlm.nih.gov tel: 301.402.4041 fax: 301.402.9651 From jkb at sanger.ac.uk Tue Sep 9 03:44:53 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Tue Sep 9 03:45:23 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX In-Reply-To: References: Message-ID: <20080909104453.GA32763@sanger.ac.uk> Hello Martin, > Are there any known problems with this ? It's a new one to me. > I'm inquiring on behalf of a submitter trying to construct SRF files > for Illumina runs on an OSX machine. Do you know what type of OSX machine? Intel or PPC? If PPC then it could be an endianness bug. I'm building it now to check (it used to work, but I rarely test on big-endian systems so who knows). James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From shumwaym at ncbi.nlm.nih.gov Tue Sep 9 13:48:38 2008 From: shumwaym at ncbi.nlm.nih.gov (Shumway, Martin (NIH/NLM/NCBI) [E]) Date: Tue Sep 9 13:48:45 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX In-Reply-To: <20080909104453.GA32763@sanger.ac.uk> References: <20080909104453.GA32763@sanger.ac.uk> Message-ID: James, Here are the details: arch=i386, and see other outputs below. Martin i386 On Sep 9, 2008, at 1:39 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > Sorry, one more question: > > can you run "arch" > on your sysetm ? > > Thanks, Martin > > -----Original Message----- > From: Kristian Stevens [mailto:kastevens@ucdavis.edu] > Sent: Tuesday, September 09, 2008 4:38 PM > To: Shumway, Martin (NIH/NLM/NCBI) [E] > Cc: O'Sullivan, Christopher (NIH/NLM/NCBI) [C]; Charis M. Cardeno; > Chuck > Langley > Subject: Re: SRTA SUBMISSION > > Please share. > > On Sep 9, 2008, at 1:33 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > >> Hi Kristian, >> >> I'd like to propagate this bug report to the authors of io_lib. >> Do you mind sharing >> >> - version of io_lib > io_lib-1.11.3 >> >> - runtime environment (output of uname -a on your system) > Darwin i-dpgp.ucdavis.edu 9.2.0 Darwin Kernel Version 9.2.0: Tue Feb > 5 16:13:22 PST 2008; root:xnu-1228.3.13~1/RELEASE_I386 i386 > aka > OSX 10.5 Leopard Server >> >> - error messages or symptoms when running illumina2srf > none >> >> - error messages or symptoms when running srf2fasta > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:497 > TGATCAAAAGCATAGATCCTCGAATTGCCAGTGGTA >> 080128_HWI-EAS162_0002_FC2027UA:5:110:117:580 > TAAATGTACATTGAAACATTATTTTATTTTATATAT > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> >> -----Original Message----- From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Tuesday, September 09, 2008 6:45 AM To: Shumway, Martin (NIH/NLM/NCBI) [E] Cc: ssrformat@bcgsc.ca Subject: Re: [Ssrformat] cannot get illumina2srf to work under OSX Hello Martin, > Are there any known problems with this ? It's a new one to me. > I'm inquiring on behalf of a submitter trying to construct SRF files > for Illumina runs on an OSX machine. Do you know what type of OSX machine? Intel or PPC? If PPC then it could be an endianness bug. I'm building it now to check (it used to work, but I rarely test on big-endian systems so who knows). James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From CRaczy at illumina.com Tue Sep 9 14:32:41 2008 From: CRaczy at illumina.com (Raczy, Come) Date: Tue Sep 9 14:37:10 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX References: <20080909104453.GA32763@sanger.ac.uk> Message-ID: Hi Martin, Can you confirm that you are getting the error when you are using srf2fasta on an archive created with illumina2srf, as indicated here: >> - error messages or symptoms when running illumina2srf > none >> >> - error messages or symptoms when running srf2fasta > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. Did you try to use srf2illumina on the same archive? Come -----Original Message----- From: ssrformat-bounces@mail.bcgsc.ca on behalf of Shumway, Martin (NIH/NLM/NCBI) [E] Sent: Tue 9/9/2008 9:48 PM To: James Bonfield Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] cannot get illumina2srf to work under OSX James, Here are the details: arch=i386, and see other outputs below. Martin i386 On Sep 9, 2008, at 1:39 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > Sorry, one more question: > > can you run "arch" > on your sysetm ? > > Thanks, Martin > > -----Original Message----- > From: Kristian Stevens [mailto:kastevens@ucdavis.edu] > Sent: Tuesday, September 09, 2008 4:38 PM > To: Shumway, Martin (NIH/NLM/NCBI) [E] > Cc: O'Sullivan, Christopher (NIH/NLM/NCBI) [C]; Charis M. Cardeno; > Chuck > Langley > Subject: Re: SRTA SUBMISSION > > Please share. > > On Sep 9, 2008, at 1:33 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > >> Hi Kristian, >> >> I'd like to propagate this bug report to the authors of io_lib. >> Do you mind sharing >> >> - version of io_lib > io_lib-1.11.3 >> >> - runtime environment (output of uname -a on your system) > Darwin i-dpgp.ucdavis.edu 9.2.0 Darwin Kernel Version 9.2.0: Tue Feb > 5 16:13:22 PST 2008; root:xnu-1228.3.13~1/RELEASE_I386 i386 > aka > OSX 10.5 Leopard Server >> >> - error messages or symptoms when running illumina2srf > none >> >> - error messages or symptoms when running srf2fasta > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:497 > TGATCAAAAGCATAGATCCTCGAATTGCCAGTGGTA >> 080128_HWI-EAS162_0002_FC2027UA:5:110:117:580 > TAAATGTACATTGAAACATTATTTTATTTTATATAT > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> >> -----Original Message----- From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Tuesday, September 09, 2008 6:45 AM To: Shumway, Martin (NIH/NLM/NCBI) [E] Cc: ssrformat@bcgsc.ca Subject: Re: [Ssrformat] cannot get illumina2srf to work under OSX Hello Martin, > Are there any known problems with this ? It's a new one to me. > I'm inquiring on behalf of a submitter trying to construct SRF files > for Illumina runs on an OSX machine. Do you know what type of OSX machine? Intel or PPC? If PPC then it could be an endianness bug. I'm building it now to check (it used to work, but I rarely test on big-endian systems so who knows). James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat From jkb at sanger.ac.uk Wed Sep 10 01:14:09 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Wed Sep 10 01:14:16 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX In-Reply-To: References: Message-ID: <20080910081408.GD32763@sanger.ac.uk> Hello Martin, > I'm inquiring on behalf of a submitter trying to construct SRF files > for Illumina runs on an OSX machine. I have easy access to an old PPC iMac so I tested the 1.11.3 and latest CVS builds on that platform and it worked just fine. Running srf2fastq worked too, and indeed running both codes version on an AMD64 vs PPC gave binary identical results for the srf files and also fasta/fastq conversions. I don't have ready access to an intel Mac, but I'm sure there's lots around I could borrow time on if necessary. It's possible it's a data-centric bug. Is it possible to shrink the data set and find a small amount (1 tile maybe) that produces an error? James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Wed Sep 10 04:42:37 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Wed Sep 10 04:42:43 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX In-Reply-To: References: <20080909104453.GA32763@sanger.ac.uk> Message-ID: <20080910114237.GI32763@sanger.ac.uk> On Tue, Sep 09, 2008 at 04:48:38PM -0400, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > >> - version of io_lib > > io_lib-1.11.3 > >> > >> - runtime environment (output of uname -a on your system) > > Darwin i-dpgp.ucdavis.edu 9.2.0 Darwin Kernel Version 9.2.0: Tue Feb > > 5 16:13:22 PST 2008; root:xnu-1228.3.13~1/RELEASE_I386 i386 > > aka > > OSX 10.5 Leopard Server I've found the bug now. Specifically it is that intel based Macs will be producing incorrect SRF files, and maybe some other tools will break too. Thanks for identifying the error. > >> > >> - error messages or symptoms when running illumina2srf > > none No error messages, but actually this will have produced incorrect data. The error was in my hacky os.h which detects machine endianess. I had the autoconf macros in there as well, but apparently they were being overridden by my own (broken) checks. I swapped the order around a bit and it seems to now work correctly. The cause of this dual-nature for endianness checking comes from io_lib being both a stand-alone library and also used as part of the full Staden Package build environment. I really need to tidy this up at some stage, but as usual it's just finding the time to do it right. The bug fix is a new io_lib/os.h - just rip out system specific bits and leave the autoconf check near the start to set SP_BIG_ENDIAN or SP_LITTLE_ENDIAN and it should work. I'll produce a new distribution shortly. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From shumwaym at ncbi.nlm.nih.gov Wed Sep 10 06:13:49 2008 From: shumwaym at ncbi.nlm.nih.gov (Shumway, Martin (NIH/NLM/NCBI) [E]) Date: Wed Sep 10 06:14:03 2008 Subject: [Ssrformat] cannot get illumina2srf to work under OSX In-Reply-To: References: <20080909104453.GA32763@sanger.ac.uk> Message-ID: Hi Come, Yes, he did everything on the same archive file using the same build of the software on the same computer. So here are the facts: arch: i386 io_lib 1.11.3 uname -a: Darwin i-dpgp.ucdavis.edu 9.2.0 Darwin Kernel Version 9.2.0: Tue Feb 5 16:13:22 PST 2008; root:xnu-1228.3.13~1/RELEASE_I386 i386 aka OSX 10.5 Leopard Server illumina2srf errors: none srf2fasta errors: > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:497 > TGATCAAAAGCATAGATCCTCGAATTGCCAGTGGTA >> 080128_HWI-EAS162_0002_FC2027UA:5:110:117:580 > TAAATGTACATTGAAACATTATTTTATTTTATATAT > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. Thanks, Martin -----Original Message----- From: Raczy, Come [mailto:CRaczy@illumina.com] Sent: Tuesday, September 09, 2008 5:33 PM To: Shumway, Martin (NIH/NLM/NCBI) [E]; James Bonfield Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] cannot get illumina2srf to work under OSX Hi Martin, Can you confirm that you are getting the error when you are using srf2fasta on an archive created with illumina2srf, as indicated here: >> - error messages or symptoms when running illumina2srf > none >> >> - error messages or symptoms when running srf2fasta > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. Did you try to use srf2illumina on the same archive? Come -----Original Message----- From: ssrformat-bounces@mail.bcgsc.ca on behalf of Shumway, Martin (NIH/NLM/NCBI) [E] Sent: Tue 9/9/2008 9:48 PM To: James Bonfield Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] cannot get illumina2srf to work under OSX James, Here are the details: arch=i386, and see other outputs below. Martin i386 On Sep 9, 2008, at 1:39 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > Sorry, one more question: > > can you run "arch" > on your sysetm ? > > Thanks, Martin > > -----Original Message----- > From: Kristian Stevens [mailto:kastevens@ucdavis.edu] > Sent: Tuesday, September 09, 2008 4:38 PM > To: Shumway, Martin (NIH/NLM/NCBI) [E] > Cc: O'Sullivan, Christopher (NIH/NLM/NCBI) [C]; Charis M. Cardeno; > Chuck > Langley > Subject: Re: SRTA SUBMISSION > > Please share. > > On Sep 9, 2008, at 1:33 PM, Shumway, Martin (NIH/NLM/NCBI) [E] wrote: > >> Hi Kristian, >> >> I'd like to propagate this bug report to the authors of io_lib. >> Do you mind sharing >> >> - version of io_lib > io_lib-1.11.3 >> >> - runtime environment (output of uname -a on your system) > Darwin i-dpgp.ucdavis.edu 9.2.0 Darwin Kernel Version 9.2.0: Tue Feb > 5 16:13:22 PST 2008; root:xnu-1228.3.13~1/RELEASE_I386 i386 > aka > OSX 10.5 Leopard Server >> >> - error messages or symptoms when running illumina2srf > none >> >> - error messages or symptoms when running srf2fasta > The error I get consistently: >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:326 > TTTTATTTTAGCATTATTGGCTACAAATAAGTATGA > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> 080128_HWI-EAS162_0002_FC2027UA:5:110:120:497 > TGATCAAAAGCATAGATCCTCGAATTGCCAGTGGTA >> 080128_HWI-EAS162_0002_FC2027UA:5:110:117:580 > TAAATGTACATTGAAACATTATTTTATTTTATATAT > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. > Zero or greater than one BASE chunks found. >> >> -----Original Message----- From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Tuesday, September 09, 2008 6:45 AM To: Shumway, Martin (NIH/NLM/NCBI) [E] Cc: ssrformat@bcgsc.ca Subject: Re: [Ssrformat] cannot get illumina2srf to work under OSX Hello Martin, > Are there any known problems with this ? It's a new one to me. > I'm inquiring on behalf of a submitter trying to construct SRF files > for Illumina runs on an OSX machine. Do you know what type of OSX machine? Intel or PPC? If PPC then it could be an endianness bug. I'm building it now to check (it used to work, but I rarely test on big-endian systems so who knows). James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat From jkb at sanger.ac.uk Thu Sep 11 08:45:07 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Sep 11 08:45:38 2008 Subject: New io_lib release: (Was: [Ssrformat] cannot get illumina2srf to work under OSX) In-Reply-To: <20080910081408.GD32763@sanger.ac.uk> References: <20080910081408.GD32763@sanger.ac.uk> Message-ID: <20080911154506.GM32763@sanger.ac.uk> Hello all, I've finally released the latest version of io_lib - version 1.11.4. See: https://sourceforge.net/project/showfiles.php?group_id=100316&package_id=108243&release_id=625645 The list of changes follows. Version 1.11.4 (11th September 2008) -------------- * New "make check" build target to perform some automated tested. Currently limited to testing the SRF tools. * Fixed machine endianness issues. Specifically this resolves known Intel MacOS-X problems. * New SRF tools - srf_info: reports simple metrics on the contents of an SRF file. - srf_filter: slices and dices the SRF file to produce a new one with various types of data removed. * illumina2srf - Minor float/int rounding change when storing int/nse/sig2 data. - Improved error detection such that it returns a failure code more often given a parsing issue. - Added -pf/pr parameters for storing Phasing files. - Reduced memory usage, especially on large numbers of clusters per tile. We may now produce multiple DBH blocks per tile. Also major reduction to memory when handling the .params files. - Added storage of 2nd .params file (firecrest). - Fixed bug in the automatic base-call version identification. - Fixed a bug with using -qf/qr when not providing all tiles (ie not starting from tile number 1). - Bug fix with storing the reverse matrix file in paired-end runs; a duplicate of the forward one was being used instead. * General SRF - Improved error checking in srf_index_hash. It now spots duplicate reads and also has a -c option to check an existing SRF file without writing the index. - Fixed a memory leak in srf_next_ztr(), triggered in srf2fastq -C. James PS. Just a heads-up, it's likely the next main release, 1.12, will be called staden_io_lib and maybe the library will be libstaden_read.a instead of libread.a too. This is a change proposed by Debian and I think already being used by some Linux vendors to avoid name clashes. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From JEmhoff at helicosbio.com Wed Sep 17 13:40:26 2008 From: JEmhoff at helicosbio.com (John Emhoff) Date: Wed Sep 17 13:40:33 2008 Subject: [Ssrformat] Index block Message-ID: <413BC2C30D426E45A512157E7C9BFB4301267348@wildcat.HBSC.local> Howdy -- It appears to me that the current SRF tools generate files that are ever-so-slightly out of spec. The spec indicates that the last eight bytes of the file are to be the size of the index block, and should be zero if the index block is not present. My interpretation is that the index block itself is optional, but not these trailing eight bytes. I bring this up because this seems like the only sure-fire mechanism to detect whether or not an SRF file has an index block (other than traversing the entire file, of course). -- John From jkb at sanger.ac.uk Thu Sep 18 02:03:19 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Sep 18 02:03:29 2008 Subject: [Ssrformat] Index block In-Reply-To: <413BC2C30D426E45A512157E7C9BFB4301267348@wildcat.HBSC.local> References: <413BC2C30D426E45A512157E7C9BFB4301267348@wildcat.HBSC.local> Message-ID: <20080918090319.GY32763@sanger.ac.uk> On Wed, Sep 17, 2008 at 04:40:26PM -0400, John Emhoff wrote: > It appears to me that the current SRF tools generate files that are > ever-so-slightly out of spec. The spec indicates that the last eight > bytes of the file are to be the size of the index block, and should be > zero if the index block is not present. > > My interpretation is that the index block itself is optional, but not > these trailing eight bytes. Thank you for reminding me of this. I had meant to fix this. You are correct, although the simple fact is that the index system and the solexa2srf (now illumina2srf tool) predates that addition in the SRF spec by the best part of a year. That doesn't mean it's correct though so it's on the to-do list. In my own code I actually treat zero as a block type which a format of block_type + 7 bytes of zero. It seems a bit odd, but it means I can gloss over them transparently, even when we've concatenated files together using simply /bin/cat. > I bring this up because this seems like the > only sure-fire mechanism to detect whether or not an SRF file has an > index block (other than traversing the entire file, of course). In practice the chances of accidentally identifying the end of an SRF file to be an index are miniscule. The index format ends with "Ihsh1.01" followed by 8 bytes of length. Seeking back that length should then yield "Ihsh1.01" and the length again. That's 256 bits of information that has to be self consistent - vanishingly small to happen by chance. The main thing that having a mandatory zero index-size brings is ease of detection I guess. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Thu Sep 18 02:29:57 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Sep 18 02:30:04 2008 Subject: [Ssrformat] Index block In-Reply-To: <413BC2C30D426E45A512157E7C9BFB4301267348@wildcat.HBSC.local> References: <413BC2C30D426E45A512157E7C9BFB4301267348@wildcat.HBSC.local> Message-ID: <20080918092956.GA32763@sanger.ac.uk> On Wed, Sep 17, 2008 at 04:40:26PM -0400, John Emhoff wrote: > The spec indicates that the last eight > bytes of the file are to be the size of the index block, and should be > zero if the index block is not present. Actually rereading the spec I've found an oddity. What it actually states is that the length block should be present regardless of whether the index exists which means the length itself should not be part of the index block specification, but it is. (The Index block spec ends with "indexSize" which is also listed as mandatory in the overall container structure.) I suspect this is simply an oversight. Changing this would be an incompatible change to the specification, but I suspect no one has ever implemented it correctly. Can we get this clarified? I really don't want to start handling duplicate length fields as we have existing SRF files already. While we're at it it would be preferable to adjust the definition of Container so that the figure showing multiple containers allows for index blocks and/or index block size to appear multiple times within a file (we can just mandate that only the last one is utilised). It's a minor issues, but it means concatenation works. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From JEmhoff at helicosbio.com Thu Sep 18 07:50:52 2008 From: JEmhoff at helicosbio.com (John Emhoff) Date: Thu Sep 18 07:51:12 2008 Subject: [Ssrformat] Index block In-Reply-To: <20080918092956.GA32763@sanger.ac.uk> Message-ID: <413BC2C30D426E45A512157E7C9BFB430126734E@wildcat.HBSC.local> > What it actually states is that the length block should be present > regardless of whether the index exists which means the length itself > should not be part of the index block specification, but it is. (The > Index block spec ends with "indexSize" which is also listed as > mandatory in the overall container structure.) > I suspect this is simply an oversight. Changing this would be an > incompatible change to the specification, but I suspect no one has > ever implemented it correctly. Just to make it clear for myself, the change you're suggesting is essentially just remove the "indexSize" field from the index block, and keep the 8 byte length as the file suffix (which is an external field from the index block). Is that right? Sounds good to me. > While we're at it it would be preferable to adjust the definition of > Container so that the figure showing multiple containers allows for index > blocks and/or index block size to appear multiple times within a file (we > can just mandate that only the last one is utilised). It's a minor issues, > but it means concatenation works. This also sounds like a good idea. -- John From jkb at sanger.ac.uk Thu Sep 18 08:20:00 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Sep 18 08:20:12 2008 Subject: [Ssrformat] Index block In-Reply-To: <413BC2C30D426E45A512157E7C9BFB430126734E@wildcat.HBSC.local> References: <20080918092956.GA32763@sanger.ac.uk> <413BC2C30D426E45A512157E7C9BFB430126734E@wildcat.HBSC.local> Message-ID: <20080918151959.GD32763@sanger.ac.uk> On Thu, Sep 18, 2008 at 10:50:52AM -0400, John Emhoff wrote: > Just to make it clear for myself, the change you're suggesting is > essentially just remove the "indexSize" field from the index block, and > keep the 8 byte length as the file suffix (which is an external field > from the index block). Is that right? Sounds good to me. Something equivalent to that. Either we can remove it from the definition of the index block as suggested above or adjust the initial overview table to indicate that the index block is mandatory, but in the definition of the index block allow for it to be either the full index or a null-index instead. James PS. I've just munged the srf_index_hash code so that most of the indexing code is in the io_lib library part rather than the program itself. So now it's far less work for other programs to create an index, meaning I'll provide an option for illumina2srf to write it's own index anyway. (It'll write out the 8 zeros otherwise.) -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Mon Sep 22 01:53:24 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Mon Sep 22 01:53:34 2008 Subject: [Ssrformat] Index block In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D301679819@xchange1.phage.bcgsc.ca> References: <20080918151959.GD32763@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D301679819@xchange1.phage.bcgsc.ca> Message-ID: <20080922085324.GI32763@sanger.ac.uk> On Sun, Sep 21, 2008 at 07:23:33PM -0700, Asim Siddiqui wrote: > This was corrected in the version of SRF released on June 19th > (attached). I will update the website this week. Ah thanks Asim. When I went to save the document you posted I realised I already had it too! Just that my printed out copy was the old one clearly - sorry for the confusion there. The new one is fine regarding removing the index size from the end of the index block definition. I still think though that maybe we should be explicit about index blocks (or blank index block sizes) appearing in the middle of a file - that they should be silently ignored. (Well either that or we finally should get around to writing a proper concatenation tool that strips out such things.) I'm also still unsure about the multi-file format version of SRF. I do not support this in io_lib and currently have no plans to either, until I find a valid use for it. I think I asked before who needed it and received no reply, but I'll ask again. Does anyone make use or see a use for the multi-file format of SRF, where DBH, DB and CH are not necessarily contained within the same file? James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From asims at bcgsc.ca Sun Sep 21 19:23:33 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri Sep 26 09:30:57 2008 Subject: [Ssrformat] Index block References: <20080918092956.GA32763@sanger.ac.uk><413BC2C30D426E45A512157E7C9BFB430126734E@wildcat.HBSC.local> <20080918151959.GD32763@sanger.ac.uk> Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679819@xchange1.phage.bcgsc.ca> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: ShortSequenceFormat_v_1_3_2_June_19th_2008.doc Type: application/octet-stream Size: 497152 bytes Desc: ShortSequenceFormat_v_1_3_2_June_19th_2008.doc Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080921/828daa5b/ShortSequenceFormat_v_1_3_2_June_19th_2008-0001.obj From asims at bcgsc.ca Sun Sep 28 21:57:50 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon Sep 29 11:25:53 2008 Subject: [Ssrformat] v1.3.2 Message-ID: <86C6E520C12E52429ACBCB01546DF4D30167981E@xchange1.phage.bcgsc.ca> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: SRF_v_1_3_2_June_19th_2008.pdf Type: application/pdf Size: 237392 bytes Desc: SRF_v_1_3_2_June_19th_2008.pdf Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080928/2f0d1125/SRF_v_1_3_2_June_19th_2008-0001.pdf