From JEmhoff at helicosbio.com Mon Jun 2 13:04:30 2008 From: JEmhoff at helicosbio.com (John Emhoff) Date: Mon Jun 2 13:04:37 2008 Subject: [Ssrformat] SRF XML block Message-ID: <413BC2C30D426E45A512157E7C9BFB4301267198@wildcat.HBSC.local> Hey James -- It looks like srf_read_xml and srf_write_xml in srf.c are inconsistent with each other. srf_write_xml includes the block type and block length fields in its length calculation (which I feel is consistent with the SRF spec) while srf_read_xml does not. The effect of this is an unreadable XML block. Please find below a patch for srf_read_xml! -- John --- srf-orig.c 2008-06-02 15:51:19.000000000 -0400 +++ srf.c 2008-06-02 15:54:14.000000000 -0400 @@ -295,6 +295,8 @@ if (0 != srf_read_uint32(srf, &xml->xml_len)) return -1; + xml->xml_len -= 5; + if (NULL == (xml->xml = (char *)realloc(xml->xml, xml->xml_len+1))) return -1; if (xml->xml_len != fread(xml->xml, 1, xml->xml_len, srf->fp)) From jkb at sanger.ac.uk Tue Jun 3 09:22:08 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Tue Jun 3 09:22:30 2008 Subject: [Ssrformat] Re: SRF XML block In-Reply-To: <413BC2C30D426E45A512157E7C9BFB4301267198@wildcat.HBSC.local> References: <413BC2C30D426E45A512157E7C9BFB4301267198@wildcat.HBSC.local> Message-ID: <20080603162208.GH7772@sanger.ac.uk> Hello John, > --- srf-orig.c 2008-06-02 15:51:19.000000000 -0400 > +++ srf.c 2008-06-02 15:54:14.000000000 -0400 > @@ -295,6 +295,8 @@ > if (0 != srf_read_uint32(srf, &xml->xml_len)) > return -1; > > + xml->xml_len -= 5; > + > if (NULL == (xml->xml = (char *)realloc(xml->xml, xml->xml_len+1))) > return -1; > if (xml->xml_len != fread(xml->xml, 1, xml->xml_len, srf->fp)) Thanks for spotting this. I agree that the correct behaviour is to interpret the length as the entire SRF block element rather than just the length of the XML text. Table 6.6.1 in SRF v1.3 spec states as much too. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From asims at bcgsc.ca Wed Jun 4 10:45:00 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Wed Jun 4 10:45:20 2008 Subject: [Ssrformat] telecon postponed Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797FC@xchange1.phage.bcgsc.ca> Hi all, We are due a telecon, but I'm going to delay it by two weeks. James and I are working on a major revision for the ZTR spec and I'd like to post it the group for review ahead of a call. Best Regards, Asim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080604/f765eca4/attachment.htm From cgoina at jcvi.org Mon Jun 9 08:05:56 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Mon Jun 9 08:06:03 2008 Subject: [Ssrformat] Mate pairs in SRF Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> Hello everybody, I am working on adding support for mated reads to the SOLID to SRF tool and I have some questions about the spec. The only posting related to it, that I saw, was Mike Attili's initial proposal and the documentation is really vague on this issue. 1. Why is there a need for a text chunk with REGION_LIST data when all these data can actually be put in the metadata of the REGN chunk? I guess I just don't see what type of information could be put in there and that would apply to all reads grouped under the same data header block instead of just the mated pair. 2. For SOLiD and maybe for SOLEXA too all reads have the same length - would it be possible to have just one REGN chunk in the data block header instead of one in each data block? The way I would like it to work is to define the regions (REGN chunk) in the data block header which could be overridden by a REGN chunk defined in the data block itself. I know this adds code complexity but it could eliminate a lot of redundancy and when there are 100mil paired sequences in a file that is close to 400M of saved space (1 boundary per pair). Regards Cristian Goina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080609/03706545/attachment.htm From jkb at sanger.ac.uk Mon Jun 9 08:39:32 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Mon Jun 9 08:39:39 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> Message-ID: <20080609153932.GA27919@sanger.ac.uk> On Mon, Jun 09, 2008 at 11:05:56AM -0400, Goina, Cristian wrote: > 1. Why is there a need for a text chunk with REGION_LIST data when all > these data can actually be put in the metadata of the REGN chunk? I > guess I just don't see what type of information could be put in there > and that would apply to all reads grouped under the same data header > block instead of just the mated pair. I'm not convinced of the REGION_LIST part of the text chunk either. In the ZTR documentation this contains a comment: "FIXME: Should this simply be the meta-data associated with the REGN chunk?", so clearly when I added it I was already somewhat unsure. My choice would be to ditch the REGION_LIST part of the text chunk as the REGN meta-data is the more natural location. In my implementation in illumina2srf (aka solexa2srf) I ignored the TEXT chunk. I should go back and fix the spec I guess. At the time of writing v1.3 it it was a work in progress with lots of queries and uncertainties. My intention was to have a period of discussion and consolidation, but I received zero replies to any of the questions in there. I'm working now with Asim on tidying this up though. Does anyone have any thoughts on whether it would be preferable to have ZTR v1.4 with some backwards compatible tidyups or to simply jump to v2.0 with a few incompatibilities (eg endianness change in one location for consistencies sake, removal of some defunct compression types, a change for the default order we store qualities and trace data in, etc). On one hand we get a cleaner spec with v2.0, but it also would involve more work to support more radical changes. My personal view is I'd like to tidy up and produce ZTR 2.0, but I can't really commit the time to do a decent job on it so smaller incremental changes are preferable. > 2. For SOLiD and maybe for SOLEXA too all reads have the same length - > would it be possible to have just one REGN chunk in the data block > header instead of one in each data block? The way I would like it to > work is to define the regions (REGN chunk) in the data block header > which could be overridden by a REGN chunk defined in the data block > itself. I know this adds code complexity but it could eliminate a lot of > redundancy and when there are 100mil paired sequences in a file that is > close to 400M of saved space (1 boundary per pair). I hadn't thought about potentially it going in two places and one overriding the other, but for illumina2srf the REGN chunk does already go into the header component so takes up an insignificant amount of space. The SRF spec deliberately doesn't state what parts of a ZTR file should belong in which segment (header vs body) as this unnecessarily restricts clever tricks that we hadn't thought of ourselves. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From tim at discoverybio.com Mon Jun 9 09:17:23 2008 From: tim at discoverybio.com (tim@discoverybio.com) Date: Mon Jun 9 09:18:04 2008 Subject: [Ssrformat] Mate pairs in SRF Message-ID: <200806091617.m59GHO02020855@omr4.networksolutionsemail.com> For SOLiD and presumably Solexa there is no guarantee that both ends will be sequenced to the same length. They usually are, but machine or timing issues can result in differences. Tim Hunkapiller 206-898-0568 tim@discoverybio.com -----Original Message----- From: James Bonfield Sent: Monday, June 09, 2008 8:39 AM To: "Goina, Cristian" Cc: ssrformat@bcgsc.ca Subject: Re: [Ssrformat] Mate pairs in SRF On Mon, Jun 09, 2008 at 11:05:56AM -0400, Goina, Cristian wrote: > 1. Why is there a need for a text chunk with REGION_LIST data when all > these data can actually be put in the metadata of the REGN chunk? I > guess I just don't see what type of information could be put in there > and that would apply to all reads grouped under the same data header > block instead of just the mated pair. I'm not convinced of the REGION_LIST part of the text chunk either. In the ZTR documentation this contains a comment: "FIXME: Should this simply be the meta-data associated with the REGN chunk?", so clearly when I added it I was already somewhat unsure. My choice would be to ditch the REGION_LIST part of the text chunk as the REGN meta-data is the more natural location. In my implementation in illumina2srf (aka solexa2srf) I ignored the TEXT chunk. I should go back and fix the spec I guess. At the time of writing v1.3 it it was a work in progress with lots of queries and uncertainties. My intention was to have a period of discussion and consolidation, but I received zero replies to any of the questions in there. I'm working now with Asim on tidying this up though. Does anyone have any thoughts on whether it would be preferable to have ZTR v1.4 with some backwards compatible tidyups or to simply jump to v2.0 with a few incompatibilities (eg endianness change in one location for consistencies sake, removal of some defunct compression types, a change for the default order we store qualities and trace data in, etc). On one hand we get a cleaner spec with v2.0, but it also would involve more work to support more radical changes. My personal view is I'd like to tidy up and produce ZTR 2.0, but I can't really commit the time to do a decent job on it so smaller incremental changes are preferable. > 2. For SOLiD and maybe for SOLEXA too all reads have the same length - > would it be possible to have just one REGN chunk in the data block > header instead of one in each data block? The way I would like it to > work is to define the regions (REGN chunk) in the data block header > which could be overridden by a REGN chunk defined in the data block > itself. I know this adds code complexity but it could eliminate a lot of > redundancy and when there are 100mil paired sequences in a file that is > close to 400M of saved space (1 boundary per pair). I hadn't thought about potentially it going in two places and one overriding the other, but for illumina2srf the REGN chunk does already go into the header component so takes up an insignificant amount of space. The SRF spec deliberately doesn't state what parts of a ZTR file should belong in which segment (header vs body) as this unnecessarily restricts clever tricks that we hadn't thought of ourselves. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, [The entire original message is not included] From cgoina at jcvi.org Mon Jun 9 09:43:30 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Mon Jun 9 09:43:36 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <200806091617.m59GHO0w020854@omr4.networksolutionsemail.com> References: <200806091617.m59GHO0w020854@omr4.networksolutionsemail.com> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138C1@EXCHANGE.TIGR.ORG> That really doesn't create any problems if all reads from an end have the same length. For example if all forward reads are 35 and all reverse reads are 25 it will work. The REGN per dataheader block will create problems only if some forward reads are let's say 35 and others are 25. >From what I heard this is possible for 454 but I haven't seen any data from SOLiD like that. Do you know if this is possible - maybe Jonathan can chime in and give us his thoughts on this regarding SOLiD? But then if this scenario is possible for 454 the spec should provide for a mechanism to allow mated sequences of different lengths in the same SRF. Cristian Goina > -----Original Message----- > From: tim@discoverybio.com [mailto:tim@discoverybio.com] > Sent: Monday, June 09, 2008 12:17 PM > To: jkb@sanger.ac.uk; Goina, Cristian > Cc: ssrformat@bcgsc.ca > Subject: RE: [Ssrformat] Mate pairs in SRF > > For SOLiD and presumably Solexa there is no guarantee that > both ends will be sequenced to the same length. They usually > are, but machine or timing issues can result in differences. > > Tim Hunkapiller > 206-898-0568 > tim@discoverybio.com > > > -----Original Message----- > From: James Bonfield > Sent: Monday, June 09, 2008 8:39 AM > To: "Goina, Cristian" > Cc: ssrformat@bcgsc.ca > Subject: Re: [Ssrformat] Mate pairs in SRF > > On Mon, Jun 09, 2008 at 11:05:56AM -0400, Goina, Cristian wrote: > > 1. Why is there a need for a text chunk with REGION_LIST > data when all > > these data can actually be put in the metadata of the REGN chunk? I > > guess I just don't see what type of information could be > put in there > > and that would apply to all reads grouped under the same > data header > > block instead of just the mated pair. > > I'm not convinced of the REGION_LIST part of the text chunk > either. In the ZTR documentation this contains a comment: > "FIXME: Should this simply be the meta-data associated with > the REGN chunk?", so clearly when I added it I was already > somewhat unsure. My choice would be to ditch the REGION_LIST > part of the text chunk as the REGN meta-data is the more > natural location. > > In my implementation in illumina2srf (aka solexa2srf) I > ignored the TEXT chunk. I should go back and fix the spec I > guess. At the time of writing v1.3 it it was a work in > progress with lots of queries and uncertainties. My intention > was to have a period of discussion and consolidation, but I > received zero replies to any of the questions in there. > > I'm working now with Asim on tidying this up though. > > Does anyone have any thoughts on whether it would be > preferable to have ZTR v1.4 with some backwards compatible > tidyups or to simply jump to v2.0 with a few > incompatibilities (eg endianness change in one location for > consistencies sake, removal of some defunct compression > types, a change for the default order we store qualities and > trace data in, etc). On one hand we get a cleaner spec with > v2.0, but it also would involve more work to support more > radical changes. > > My personal view is I'd like to tidy up and produce ZTR 2.0, > but I can't really commit the time to do a decent job on it > so smaller incremental changes are preferable. > > > 2. For SOLiD and maybe for SOLEXA too all reads have the > same length - > > would it be possible to have just one REGN chunk in the data block > > header instead of one in each data block? The way I would > like it to > > work is to define the regions (REGN chunk) in the data block header > > which could be overridden by a REGN chunk defined in the data block > > itself. I know this adds code complexity but it could > eliminate a lot > > of redundancy and when there are 100mil paired sequences in a file > > that is close to 400M of saved space (1 boundary per pair). > > I hadn't thought about potentially it going in two places and > one overriding the other, but for illumina2srf the REGN chunk > does already go into the header component so takes up an > insignificant amount of space. > > The SRF spec deliberately doesn't state what parts of a ZTR > file should belong in which segment (header vs body) as this > unnecessarily restricts clever tricks that we hadn't thought > of ourselves. > > James > > -- > James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc > et Slythia Tova > | Plurima gyrabant gymbolitare vabo; > A Staden Package developer: | Et Borogovorum mimzebant > undique formae, > > > [The entire original message is not included] > > From jkb at sanger.ac.uk Mon Jun 9 09:59:57 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Mon Jun 9 10:00:05 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138C1@EXCHANGE.TIGR.ORG> References: <200806091617.m59GHO0w020854@omr4.networksolutionsemail.com> <3EC2264E4927524DB3F1C0425831CCA51138C1@EXCHANGE.TIGR.ORG> Message-ID: <20080609165957.GB27919@sanger.ac.uk> On Mon, Jun 09, 2008 at 12:43:30PM -0400, Goina, Cristian wrote: > That really doesn't create any problems if all reads from an end have > the same length. For example if all forward reads are 35 and all reverse > reads are 25 it will work. The REGN per dataheader block will create > problems only if some forward reads are let's say 35 and others are 25. > >From what I heard this is possible for 454 but I haven't seen any data > from SOLiD like that. Agreed on solexa/solid - all samples within a single tile/panel should have the same number of incorporation steps so the REGN values are constant. For 454 it probably depends on the protocols being used. I could envisage a scenario where a fixed number of flows are performed from one end and another set from the other. The number of base calls per end will therefore vary each reading due to the nature of pyrosequencing, but the number of flows is still constant. This is why there is provision for specifying the units used in the REGN chunk to be trace coordinates instead of base coordinates. In reality though I'm not sure if there are any 454 protocols that work this way. My understanding of the current paired end 454 protocols are that the sequence is sliced/diced in the wet-lab stages to end up with one piece of dna containing both ends of the pair joined together with an adapter between them. It's then handled as a single read as far as the instrument is concerned. The two ends are then bioinformatically determined, although whether that's before or after writing the SRF I wouldn't know. I think though there maybe tagging protocols where specifying a constant REGN for 454 makes sense. The most obvious of these is the initial TACG tag added to the start of every real read (and something else, TCAG?, added to all control samples spiked into the data set). (It's still not possible to do this at the flow coordinate level though as you sometimes end up with a fractional number of flows forming the tag, eg if the sequence starts with a G.) James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From asims at bcgsc.ca Mon Jun 9 16:59:46 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon Jun 9 17:00:10 2008 Subject: [Ssrformat] Mate pairs in SRF References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> Hi Cristian, The REGION_LIST as part of the text chunk has been dropped from V2.0 of ZTR No problem with the REGN Block going into the header, but overrides are not part of the ZTR spec. Can you clarify the situation where that is required? Is it that you have certain reads in the panel for which only one end is sequenced? As Jon may have mentioned, I'm working with him to create a spec defining the SRF/ZTR implementation spec for SOLiD. I think we are pretty much in sync, but we should make sure. Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of James Bonfield Sent: Mon 09/06/2008 8:39 AM To: Goina, Cristian Cc: ssrformat@bcgsc.ca Subject: Re: [Ssrformat] Mate pairs in SRF On Mon, Jun 09, 2008 at 11:05:56AM -0400, Goina, Cristian wrote: > 1. Why is there a need for a text chunk with REGION_LIST data when all > these data can actually be put in the metadata of the REGN chunk? I > guess I just don't see what type of information could be put in there > and that would apply to all reads grouped under the same data header > block instead of just the mated pair. I'm not convinced of the REGION_LIST part of the text chunk either. In the ZTR documentation this contains a comment: "FIXME: Should this simply be the meta-data associated with the REGN chunk?", so clearly when I added it I was already somewhat unsure. My choice would be to ditch the REGION_LIST part of the text chunk as the REGN meta-data is the more natural location. In my implementation in illumina2srf (aka solexa2srf) I ignored the TEXT chunk. I should go back and fix the spec I guess. At the time of writing v1.3 it it was a work in progress with lots of queries and uncertainties. My intention was to have a period of discussion and consolidation, but I received zero replies to any of the questions in there. I'm working now with Asim on tidying this up though. Does anyone have any thoughts on whether it would be preferable to have ZTR v1.4 with some backwards compatible tidyups or to simply jump to v2.0 with a few incompatibilities (eg endianness change in one location for consistencies sake, removal of some defunct compression types, a change for the default order we store qualities and trace data in, etc). On one hand we get a cleaner spec with v2.0, but it also would involve more work to support more radical changes. My personal view is I'd like to tidy up and produce ZTR 2.0, but I can't really commit the time to do a decent job on it so smaller incremental changes are preferable. > 2. For SOLiD and maybe for SOLEXA too all reads have the same length - > would it be possible to have just one REGN chunk in the data block > header instead of one in each data block? The way I would like it to > work is to define the regions (REGN chunk) in the data block header > which could be overridden by a REGN chunk defined in the data block > itself. I know this adds code complexity but it could eliminate a lot of > redundancy and when there are 100mil paired sequences in a file that is > close to 400M of saved space (1 boundary per pair). I hadn't thought about potentially it going in two places and one overriding the other, but for illumina2srf the REGN chunk does already go into the header component so takes up an insignificant amount of space. The SRF spec deliberately doesn't state what parts of a ZTR file should belong in which segment (header vs body) as this unnecessarily restricts clever tricks that we hadn't thought of ourselves. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080609/bc9bc053/attachment.htm From cgoina at jcvi.org Tue Jun 10 05:02:57 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Tue Jun 10 05:03:10 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138C4@EXCHANGE.TIGR.ORG> Asim, now that you mentioned I think it is possible that some reads will not have mates. I'm sure Jon can explain this a lot better but I think in SOLiD's case F3 and R3 reads are generated at different times and maybe when the last of the two primers is run the bead may have degraded so badly that the read is not usable or maybe not even placed in the results file. I will let you know if this is really the case with the data I am testing against when I get to run something. Moreover this has nothing to do with allowing overrides in ZTR - is more like which REGN block to apply for interpreting the SRF data block and the logic is not that complicated if ZTR_REGN present in the DB use this one otherwise use ZTR_REGN from corresponding DBH For now I think I will try this approach until I am convinced from the data that if R3_F3 all reads have mates - it's OK if they are in the rejected file but they must be either in the good reads file or the rejected reads file Cristian ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Monday, June 09, 2008 8:00 PM To: James Bonfield; Goina, Cristian Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Mate pairs in SRF Hi Cristian, The REGION_LIST as part of the text chunk has been dropped from V2.0 of ZTR No problem with the REGN Block going into the header, but overrides are not part of the ZTR spec. Can you clarify the situation where that is required? Is it that you have certain reads in the panel for which only one end is sequenced? As Jon may have mentioned, I'm working with him to create a spec defining the SRF/ZTR implementation spec for SOLiD. I think we are pretty much in sync, but we should make sure. Asim ... rest of the message deleted -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080610/f7263792/attachment.htm From cgoina at jcvi.org Wed Jun 11 11:44:44 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Wed Jun 11 11:44:52 2008 Subject: [Ssrformat] About SRF parsing In-Reply-To: <20080611151525.GG27919@sanger.ac.uk> References: <484E4950.1090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C3@EXCHANGE.TIGR.ORG> <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> This is a follow up on some discussion Rajesh Radhakrishnan and I were having related to parsing some SRF file using the java tool and we asked James for help in looking if there is a problem with the SRF file itself. I am posting it here only because there might be a bug in the io_lib. James, I think I know why io_lib is working with this bad SRF. Here are the steps - read ZTR blob that belongs to SRF data header in memory in an mFILE object (I will use the term object even though it's just a pointer) - this will read all the bytes that belong to SRF DBH's ZTR which ends with SMP4\0\0 000019f0: 6e61 6c79 7369 733e 0a00 534d 5034 0000 nalysis>..SMP4.. - because extracting SMP4 chunk fails silently due to malloc problems mFILE's pointer is rolled back and what's left in it is: 000019f0: XXXX XXXX XXXX XXXX XXXX 534d 5034 0000 XXXXXXXXXXSMP4.. (the Xs are already consumed so mFILE points to SMP4\0\0 - continue reading from SRF's FILE object (note this is a FILE not an mFILE) the next datablock fields (type, read flags, read id) 00001a00: 5200 0006 9800 0735 3634 3a33 3033 R......564:303 - then read the corresponding ZTR chunks but the chunks are read in bulk in an new allocated buffer - and what it reads is: 00001a00: 000b R......564:303.. 00001a10: 4f46 4653 0032 3831 3837 0000 0002 494d OFFS.28187....IM 00001a20: 813a a001 1e70 c0bf 8f4a d9c7 f1d3 .... - then take the buffer and append it to the existing mFILE's buffer which ends with SMP4\0\0 and the result is XXXX XXXX XXXX XXXX XXXX 534d 5034 000b XXXXXXXXXXSMP4.... 4f46 4653 0032 3831 3837 0000 0002 494d OFFS.28187....IM 813a a001 1e70 c0bf 8f4a d9c7 f1d3 .... - then continue from where it's left which just happens to work correctly because the next chunk is the SMP4. Now this explains the reading and even though I cannot explain the writing I bet something similar happened - the difference is that it wrote a wrong number of bytes when it wrote the DBH's ZTR so it wrote 6 additional bytes that should've been written with the DB's ZTR. So my guess is that the bug is when it calculates the size of the DBH's ZTR blob. I hope your findings are somewhat similar and the fix is not too bad. Maybe you already fixed the problem and all it needs to be done is to regenerate the SRF Cristian Goina From asims at bcgsc.ca Thu Jun 12 00:31:31 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Thu Jun 12 00:31:45 2008 Subject: [Ssrformat] Mate pairs in SRF References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA51138C4@EXCHANGE.TIGR.ORG> Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679801@xchange1.phage.bcgsc.ca> Cristian, Jon and I spoke and we recommend padding the BASE (and CNF1 and SAMP) chunk missing mate instead. Padding is going to pop up anyway (see next paragraph) and with RLE compression requires less space than an additionl REGN block. If the constraints on the quality of mate pairing reads are relaxed, it is possible to have a shorter read than usual e.g. usual read length for one end may be 35, but a particular read may only have 30 bp yet still yield a good alignment. In this situation the SOLiD read will contain a padding character ".". I suggest we use the same character in SRF/ZTR - note that this is distinct to "N" which represents a base/colour for which a call was attempted. "." represents a base/colour for which a call was not attempted. Asim ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Tue 10/06/2008 5:02 AM To: Asim Siddiqui; James Bonfield Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Mate pairs in SRF Asim, now that you mentioned I think it is possible that some reads will not have mates. I'm sure Jon can explain this a lot better but I think in SOLiD's case F3 and R3 reads are generated at different times and maybe when the last of the two primers is run the bead may have degraded so badly that the read is not usable or maybe not even placed in the results file. I will let you know if this is really the case with the data I am testing against when I get to run something. Moreover this has nothing to do with allowing overrides in ZTR - is more like which REGN block to apply for interpreting the SRF data block and the logic is not that complicated if ZTR_REGN present in the DB use this one otherwise use ZTR_REGN from corresponding DBH For now I think I will try this approach until I am convinced from the data that if R3_F3 all reads have mates - it's OK if they are in the rejected file but they must be either in the good reads file or the rejected reads file Cristian ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Monday, June 09, 2008 8:00 PM To: James Bonfield; Goina, Cristian Cc: ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Mate pairs in SRF Hi Cristian, The REGION_LIST as part of the text chunk has been dropped from V2.0 of ZTR No problem with the REGN Block going into the header, but overrides are not part of the ZTR spec. Can you clarify the situation where that is required? Is it that you have certain reads in the panel for which only one end is sequenced? As Jon may have mentioned, I'm working with him to create a spec defining the SRF/ZTR implementation spec for SOLiD. I think we are pretty much in sync, but we should make sure. Asim ... rest of the message deleted -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080612/4357daf9/attachment.htm From jkb at sanger.ac.uk Thu Jun 12 01:08:28 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jun 12 01:08:35 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D301679801@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA51138C4@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D301679801@xchange1.phage.bcgsc.ca> Message-ID: <20080612080828.GH27919@sanger.ac.uk> Hello Asim, > Jon and I spoke and we recommend padding the BASE (and CNF1 and > SAMP) chunk missing mate instead. > Padding is going to pop up anyway (see next paragraph) and with RLE > compression requires less space than an additionl REGN block. I had another idea last night. What percentage of sequences end up with just one end? If it's just a handful then padding is maybe a sensible solution. However I thought perhaps we could generate two DBH blocks per panel instead of one. The first consists of all reads with both ends present and the second with the reads only having one end. This means we don't need any fancy codes to indicate the end isn't present or add fake data to pad it out. The overheads would be minimal too. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Thu Jun 12 01:54:42 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jun 12 01:54:48 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> References: <484E4950.1090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C3@EXCHANGE.TIGR.ORG> <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> Message-ID: <20080612085442.GI27919@sanger.ac.uk> Initially the thought of corrupted SRF files had me worrying, especially when I realised they were 1000genomes data being generated by my code. However to summarise the mail below I think that the files are infact correct and io_lib's behaviour is correct too. On Wed, Jun 11, 2008 at 02:44:44PM -0400, Goina, Cristian wrote: > - read ZTR blob that belongs to SRF data header in memory in an mFILE > object (I will use the term object even though it's just a pointer) - Well I'd argue a class is just a pointer to a block of memory too; in C++ you can even replace the word class with struct and it still compiles and runs as a class is basically a struct with some standard pointers to functions in it (constructor, etc). mFILE's really are a truely hideous hack; one I was both strangely proud of and yet slightly appalled of at the same time. :-) Anyway semantics aside, on to the real issue. > this will read all the bytes that belong to SRF DBH's ZTR which ends > with SMP4\0\0 > 000019f0: 6e61 6c79 7369 733e 0a00 534d 5034 0000 nalysis>..SMP4.. This is infact correct and intended. The SRF specification does not state that the DBH should end on an exact ZTR chunk boundary. About all it states is that when DBH and DB are pasted together the final string should be a valid ZTR file. I wrote about this in the SRF Rationale, but I'm not sure if this ever made it into the official SRF specification. (Quite possibly we haven't had a new official release since then.) > - because extracting SMP4 chunk fails silently due to malloc problems > mFILE's pointer is rolled back and what's left in it is: > > 000019f0: XXXX XXXX XXXX XXXX XXXX 534d 5034 0000 XXXXXXXXXXSMP4.. > (the Xs are already consumed so mFILE points to SMP4\0\0 Are you sure it actually fails with malloc problems? I can't see any in io_lib's srf2fasta either by manually stepping through the appropriate part of the code or in valgrind. I did just identify a memory leak though where it fails to free the result of ztr_find_chunks() - oops, I thought I'd searched for all of those before. The naive approach is not decode the DBH ztr components at all and paste DBH + DB together and then decode the ZTR as a whole. This works and is simple to implement, but it's slow. It's how I initially did things though to test the early components. A faster method is to decode only the parts of the ZTR file that are complete and to remember where we failed to read in a full ZTR header. This is performed in partial_decode_ztr() in srf.c: /* Load chunks */ pos = mftell(mf); while (chunk = ztr_read_chunk_hdr(mf)) { chunk->data = (char *)xmalloc(chunk->dlength); if (chunk->dlength != mfread(chunk->data, 1, chunk->dlength, mf)) break; ztr->nchunks++; ztr->chunk = (ztr_chunk_t *)xrealloc(ztr->chunk, ztr->nchunks * sizeof(ztr_chunk_t)); memcpy(&ztr->chunk[ztr->nchunks-1], chunk, sizeof(*chunk)); xfree(chunk); pos = mftell(mf); } ... /* Ensure we exit at the start of a ztr CHUNK */ mfseek(mf, pos, SEEK_SET); Here 'pos' ends up at the SMP4 start point as it specifically rewinds back to there when ztr_read_chunk_hdr() or the mfread of the chunk's data fails. > - continue reading from SRF's FILE object (note this is a FILE not an > mFILE) the next datablock fields (type, read flags, read id) ... > - then take the buffer and append it to the existing mFILE's buffer > which ends with SMP4\0\0 and the result is > XXXX XXXX XXXX XXXX XXXX 534d 5034 000b XXXXXXXXXXSMP4.... > 4f46 4653 0032 3831 3837 0000 0002 494d OFFS.28187....IM > 813a a001 1e70 c0bf 8f4a d9c7 f1d3 .... > > - then continue from where it's left which just happens to work > correctly because the next chunk is the SMP4. Correct, although as you hopefully now see it's not simply coincidence that it happens to work. > Now this explains the reading and even though I cannot explain the > writing I bet something similar happened The piece of writing code responsible for this sneaky byte saving is in solexa2srf.c: mFILE *encode_ztr(ztr_t *ztr, int *footer, int no_hcodes) { mFILE *mf = mfcreate(NULL, 0); int pos, i; if (!no_hcodes) ztr_store_hcodes(ztr); reorder_ztr(ztr); ztr_mwrite_header(mf, &ztr->header); if (footer) *footer = 0; /* Write out chunks */ for (i = 0; i < ztr->nchunks; i++) { pos = mftell(mf); ztr_mwrite_chunk(mf, &ztr->chunk[i]); if ((ztr->chunk[i].type == ZTR_TYPE_SMP4 || ztr->chunk[i].type == ZTR_TYPE_BASE) && footer && !*footer) { /* allows traces up to 64k */ *footer = pos + 10 + ztr->chunk[i].mdlength; } } return mf; } At one stage a version of this was using +6 as I realised that the meta data varied in length (sometimes it has OFFS\011111\0 and other times it may be OFFS\088888\0). I'm guessing that version produced the SRF file we're looking at. I later backed that change out again to the code above as it then dawned on me that I'd made a simple error and that between tiles the meta-data may be different lengths, but it's within a tile that matters and for that case it's always the same - hence the meta-data can be put in the DBH too. It's worth pointing out that this trick only saves the overhead of writing the ZTR chunk header for the first chunk in the DB "body" component. A more general approach of delayed data to detach ztr headers from data is being looked at for ZTR v2.0 (or maybe 1.4). This will help to reduce the overheads somewhat and may obviate the use of ending a DBH half way through a chunk header, although it's still not good to assume this to be the case. James PS. There's quite a lot of code in solexa2srf which I wrote purposely for turning solexa data into SRF, of course, but really should be moved from the progs directory into the library itself. Eg generic routines for reordering a ZTR file. One day maybe... -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cgoina at jcvi.org Thu Jun 12 05:01:00 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Thu Jun 12 05:01:09 2008 Subject: [Ssrformat] RE: About SRF parsing In-Reply-To: <20080612085442.GI27919@sanger.ac.uk> References: <484E4950.1090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C3@EXCHANGE.TIGR.ORG> <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> James I understand your points but I still think this file is not correct according to the spec. > > > this will read all the bytes that belong to SRF DBH's ZTR > which ends > > with SMP4\0\0 > > 000019f0: 6e61 6c79 7369 733e 0a00 534d 5034 0000 nalysis>..SMP4.. > > This is infact correct and intended. I agree > > The SRF specification does not state that the DBH should end > on an exact ZTR chunk boundary. About all it states is that > when DBH and DB are pasted together the final string should > be a valid ZTR file. I wrote about this in the SRF Rationale, > but I'm not sure if this ever made it into the official SRF > specification. (Quite possibly we haven't had a new official > release since then.) > It's true the spec doesn't say that the DBH header blob should be a valid ZTR but in some sense in the way the description of each block it is structured implies some linearity. I don't think it is right to spread ZTR blob bytes across consecutive blobs. Moreover the case here is that 6 bytes from DBH header blob belong to the next SRF block and this is definitely wrong. The spec doesn't say that some of the SRF data header blob bytes may be part of the next following SRF data blob either > > - because extracting SMP4 chunk fails silently due to > malloc problems > > mFILE's pointer is rolled back and what's left in it is: > > > > 000019f0: XXXX XXXX XXXX XXXX XXXX 534d 5034 0000 XXXXXXXXXXSMP4.. > > (the Xs are already consumed so mFILE points to SMP4\0\0 > > Are you sure it actually fails with malloc problems? I can't > see any in io_lib's srf2fasta either by manually stepping > through the appropriate part of the code or in valgrind. I > did just identify a memory leak though where it fails to free > the result of > ztr_find_chunks() - oops, I thought I'd searched for all of > those before. I was wrong I thought it failed because of memory problems because I saw it failed at an unusual point but I think it just reached the end of the corresponding virtual file (mFILE) > > The naive approach is not decode the DBH ztr components at > all and paste DBH + DB together and then decode the ZTR as a > whole. This works and is simple to implement, but it's slow. > It's how I initially did things though to test the early components. These files are way too big and concatenating in memory is not feasible. With smaller DB let's say this might work but somehow that non linear solution still doesn't appeal to me. I think it just asks for trouble. You're approach practically uses two file pointers one for the SRF blocks and one for ZTR blobs that are part of those SRF blocks. > > A faster method is to decode only the parts of the ZTR file > that are complete and to remember where we failed to read in > a full ZTR header. This is performed in partial_decode_ztr() in srf.c: Then why not separate all ZTR chunks completely from the SRF blocks and instead in the SRF blocks you only keep a reference (file offset) to the corresponding ZTR chunk or chunks. This kind of approach is similar to data normalization in relational databases - now we store the entire ZTR(s) - let's change it to store only a reference to those ZTR(s) and keep the ZTR(s) separate - this is a non linear solution that can still keep things clearly defined. (After I wrote this paragraph I realized that you are envisioning something similar in one of your last comment) > > Correct, although as you hopefully now see it's not simply > coincidence that it happens to work. This doesn't mean that I agree with the solution :-) - I like it it's an elegant solution for dealing with this kind of problems but for all practical purposes it only smells trouble - I am talking about spreading ZTR bytes from the datablock in the DBH. > > > Now this explains the reading and even though I cannot explain the > > writing I bet something similar happened > > The piece of writing code responsible for this sneaky byte > saving is in solexa2srf.c: > > mFILE *encode_ztr(ztr_t *ztr, int *footer, int no_hcodes) { > mFILE *mf = mfcreate(NULL, 0); > int pos, i; > > if (!no_hcodes) > ztr_store_hcodes(ztr); > reorder_ztr(ztr); > ztr_mwrite_header(mf, &ztr->header); > > if (footer) > *footer = 0; > > /* Write out chunks */ > for (i = 0; i < ztr->nchunks; i++) { > pos = mftell(mf); > ztr_mwrite_chunk(mf, &ztr->chunk[i]); > if ((ztr->chunk[i].type == ZTR_TYPE_SMP4 || > ztr->chunk[i].type == ZTR_TYPE_BASE) && footer > && !*footer) { > /* allows traces up to 64k */ > *footer = pos + 10 + ztr->chunk[i].mdlength; > } > } > > return mf; > } > > At one stage a version of this was using +6 as I realised > that the meta data varied in length (sometimes it has > OFFS\011111\0 and other times it may be OFFS\088888\0). I'm > guessing that version produced the SRF file we're looking at. > I later backed that change out again to the code above as it > then dawned on me that I'd made a simple error and that > between tiles the meta-data may be different lengths, but > it's within a tile that matters and for that case it's always > the same - hence the meta-data can be put in the DBH too. I'm not quite following this - has this problem actually been fixed already or it's still possible that some portion of the first's block metadata will be in the DBH > > It's worth pointing out that this trick only saves the > overhead of writing the ZTR chunk header for the first chunk > in the DB "body" > component. A more general approach of delayed data to detach > ztr headers from data is being looked at for ZTR v2.0 (or > maybe 1.4). This will help to reduce the overheads somewhat > and may obviate the use of ending a DBH half way through a > chunk header, although it's still not good to assume this to > be the case. James do you mean SRF v2.0 or ZTR v2.0 - I just realized that here you opt for the same thing of separating data (ZTR chunks) from headers (SRF block) > > James > > PS. There's quite a lot of code in solexa2srf which I wrote > purposely for turning solexa data into SRF, of course, but > really should be moved from the progs directory into the > library itself. Eg generic routines for reordering a ZTR > file. One day maybe... > Cristian From cgoina at jcvi.org Thu Jun 12 05:42:15 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Thu Jun 12 05:42:21 2008 Subject: [Ssrformat] Mate pairs in SRF In-Reply-To: <20080612080828.GH27919@sanger.ac.uk> References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA51138C4@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D301679801@xchange1.phage.bcgsc.ca> <20080612080828.GH27919@sanger.ac.uk> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138CE@EXCHANGE.TIGR.ORG> Once I have something working for mate-pairs I will try this with one run here and I'll see what numbers I come up with Cristian > -----Original Message----- > From: James Bonfield [mailto:jkb@sanger.ac.uk] > Sent: Thursday, June 12, 2008 4:08 AM > To: Asim Siddiqui > Cc: Goina, Cristian; ssrformat@bcgsc.ca > Subject: Re: [Ssrformat] Mate pairs in SRF > > Hello Asim, > > > Jon and I spoke and we recommend padding the BASE (and CNF1 and > > SAMP) chunk missing mate instead. > > Padding is going to pop up anyway (see next paragraph) and with RLE > > compression requires less space than an additionl REGN block. > > I had another idea last night. What percentage of sequences > end up with just one end? If it's just a handful then padding > is maybe a sensible solution. > > However I thought perhaps we could generate two DBH blocks > per panel instead of one. The first consists of all reads > with both ends present and the second with the reads only > having one end. This means we don't need any fancy codes to > indicate the end isn't present or add fake data to pad it > out. The overheads would be minimal too. > > James > > -- > James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc > et Slythia Tova > | Plurima gyrabant gymbolitare vabo; > A Staden Package developer: | Et Borogovorum mimzebant > undique formae, > https://sf.net/projects/staden/ | Momiferique omnes > exgrabure Rathi. > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > From jkb at sanger.ac.uk Thu Jun 12 06:38:24 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jun 12 06:38:29 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> References: <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> Message-ID: <20080612133823.GK27919@sanger.ac.uk> On Thu, Jun 12, 2008 at 08:01:00AM -0400, Goina, Cristian wrote: > It's true the spec doesn't say that the DBH header blob should be a > valid ZTR but in some sense in the way the description of each block it > is structured implies some linearity. Agreed on the implications. The spec doesn't explicitly allow for the strange case of splitting part way through a chunk header, but it's too easy to assume that means it disallows it. That wasn't the intention though and it was one of the main reasons I wrote the rationale section. > I don't think it is right to > spread ZTR blob bytes across consecutive blobs. Moreover the case here > is that 6 bytes from DBH header blob belong to the next SRF block and > this is definitely wrong. The spec doesn't say that some of the SRF data > header blob bytes may be part of the next following SRF data blob either If we wrote out each individual trace as a separate ZTR file on disk then we'd observe that, for example, the first 150 bytes of every file are the same. So it's a logical conclusion from there to saying we'll split our files into a header + body with the common header component (150 bytes) being written out once followed by all the bodies (file length - 150 bytes). That's at the byte level though. If you work in ZTR chunks then you may see that the ZTR header and first two chunks take you to file offset 140 say and the next chunk differs, despite the first 10 bytes of it being the same. It's not so efficient though and we're needlessly storing extra data over and over again. Furthermore I figured that as SRF is strictly a container format (like tar or zip) then it should not have explicit knowledge of the file format of the items held within it (ZTR) and instead we should just treat them as a stream of bytes. This is flexible as it means we could put SCF traces in there, ZTR v2, CTF, or whatever we end up with down the line. > I was wrong I thought it failed because of memory problems because I saw > it failed at an unusual point but I think it just reached the end of the > corresponding virtual file (mFILE) Correct, it hits a virtual EOF and returns in error. > > The naive approach is not decode the DBH ztr components at > > all and paste DBH + DB together and then decode the ZTR as a > > whole. This works and is simple to implement, but it's slow. > > It's how I initially did things though to test the early components. > > These files are way too big and concatenating in memory is not > feasible. With smaller DB let's say this might work but somehow that non > linear solution still doesn't appeal to me. I think it just asks for > trouble. You're approach practically uses two file pointers one for the > SRF blocks and one for ZTR blobs that are part of those SRF blocks. I don't follow this. We don't have one large DB but rather lots of tiny DBs. Ie the invididual ZTR files are represented by a single DBH plus any single DB - combined they're the same size as a single ZTR file holding a single trace, which is tiny. So logically speaking converting an entire SRF file to fasta we'd do: while (B = next SRF block) if B.type is Data Block Header H_blob = B.blob end if if B.type is Data Block (aka Body) ztr_file = H_blob + B.blob (string concatenation) decode ztr_file convert and print to fasta end if end while Ie the body is attached the the most recently seen header. We just remember the last header we observed and repeatedly append differing bodies to it to generate the complete ZTR file before decoding. Now the inefficiencies with this are that we end up repeatedly decoding the ZTR chunks that are head within the header blob, so that's where the partial decoding functions come from. However that's simply an implementation detail and doesn't impact on the file format at all. > Then why not separate all ZTR chunks completely from the SRF blocks and > instead in the SRF blocks you only keep a reference (file offset) to the > corresponding ZTR chunk or chunks. This kind of approach is similar to > data normalization in relational databases - now we store the entire > ZTR(s) - let's change it to store only a reference to those ZTR(s) and > keep the ZTR(s) separate - this is a non linear solution that can still > keep things clearly defined. Firstly we wanted a format amenable to streaming, so it could be generated within a UNIX pipeline or you could be decoding it in parallell with downloading it. This also helps to keep memory usage minimal. Secondly having to access two different locations within a file is more disk I/O and while transfer speeds are improving seek times can still be very expensive. By separating a trace into the common and variable sections we implicitly need two locations, but we can specify the order things occur in to remove the need for seeks (ie by caching the most recent DBH). For random access (eg looking up a trace by name in the SRF index) we naturally need to access two locations (header + body) but there's not much we can do about that. > This doesn't mean that I agree with the solution :-) - I like it it's an > elegant solution for dealing with this kind of problems but for all > practical purposes it only smells trouble - I am talking about spreading > ZTR bytes from the datablock in the DBH. Agreed it's maybe more logical to break the header and body apart at a ZTR chunk boundary, but I just couldn't pass up such an easy opportunity to save a few more bytes per trace given there's nothing in the SRF spec that has explicit knowledge of ZTR file format. It's still only a couple of percent saved though. > > At one stage a version of this was using +6 as I realised > > that the meta data varied in length (sometimes it has > > OFFS\011111\0 and other times it may be OFFS\088888\0). I'm > > guessing that version produced the SRF file we're looking at. > > I later backed that change out again to the code above as it > > then dawned on me that I'd made a simple error and that > > between tiles the meta-data may be different lengths, but > > it's within a tile that matters and for that case it's always > > the same - hence the meta-data can be put in the DBH too. > > I'm not quite following this - has this problem actually been fixed > already or it's still possible that some portion of the first's block > metadata will be in the DBH The "problem" still exists, but arguably more so now as an even larger portion of an incomplete ZTR chunk is in the data block header. > James do you mean SRF v2.0 or ZTR v2.0 - I just realized that here you > opt for the same thing of separating data (ZTR chunks) from headers (SRF > block) We have ideas for ZTR v2, but I don't think this yet requires a major version bump of SRF. I'm still unsure if we even want to increase the major version number of ZTR yet either, but that depends on whether we break backwards compatibility. (There's plenty of ghastly cruft in there we should remove and a few inconsistencies to tidy up, which would entail breaking compatibility, but it's maybe easier to simply add our changes incrementally in a compatible fashion and just live with old cruft.) James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cgoina at jcvi.org Thu Jun 12 09:22:50 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Thu Jun 12 09:23:22 2008 Subject: [Ssrformat] RE: About SRF parsing In-Reply-To: <20080612133823.GK27919@sanger.ac.uk> References: <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> <20080612133823.GK27919@sanger.ac.uk> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138CF@EXCHANGE.TIGR.ORG> I checked the documentation and it's true it does not say the blobs must be complete ZTR chunks - I assumed so whereas you took the other path - but then it should probably mention that the ZTR blob MAY span across the DBH blob and DB blob and mention the algorithm you mention in order to obtain valid data. > while (B = next SRF block) > if B.type is Data Block Header > H_blob = B.blob > end if > > if B.type is Data Block (aka Body) > ztr_file = H_blob + B.blob (string concatenation) > decode ztr_file > convert and print to fasta > end if > end while That can also be left up to the vendor. As you said it can save up to 1 to 2% and if you think it's worth I will try to adapt. I still don't like the solution because this little undocumented detail will soon be forgotten and it can create a lot of headache and a lot of debugging cycles for SRF tool developers who will be trying to figure out why their tools work with some SRFs but not others. My two cents Cristian From jkb at sanger.ac.uk Thu Jun 12 09:31:16 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jun 12 09:31:21 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138CF@EXCHANGE.TIGR.ORG> References: <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> <20080612133823.GK27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CF@EXCHANGE.TIGR.ORG> Message-ID: <20080612163115.GM27919@sanger.ac.uk> On Thu, Jun 12, 2008 at 12:22:50PM -0400, Goina, Cristian wrote: > That can also be left up to the vendor. As you said it can save up to 1 > to 2% and if you think it's worth I will try to adapt. I still don't > like the solution because this little undocumented detail will soon be > forgotten and it can create a lot of headache and a lot of debugging > cycles for SRF tool developers who will be trying to figure out why > their tools work with some SRFs but not others. 1-2% isn't much I guess, but it seemed a freebie as essentially the way I'd written the code (somewhat fortuitously I'll admit) meant there was minimal changes required to squeeze out that little bit more. I just rechecked the SRF rationale and realised I hadn't explained as fully as I thought the ZTR chunk header issues. At the very least if we wish to keep the ability to split part way through a ZTR chunk then it should be explicitly mentioned that this may be the case so people are aware of the need to code for this happening. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rodarmer at ncbi.nlm.nih.gov Thu Jun 12 09:42:20 2008 From: rodarmer at ncbi.nlm.nih.gov (Kurt Rodarmer) Date: Thu Jun 12 09:43:23 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <20080612133823.GK27919@sanger.ac.uk> References: <484E82FC.8090703@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C6@EXCHANGE.TIGR.ORG> <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> <20080612133823.GK27919@sanger.ac.uk> Message-ID: <7.0.0.16.1.20080612121610.01957578@ncbi.nlm.nih.gov> One issue is the splitting of a ZTR chunk. Any chunked format can have its chunks split by another transport layer, provided that the latter is transparent and reassembles its payload before the former is interpreted. There is an implication stemming from discussion and existing implementation that this is the relationship between SRF and ZTR. However the spec would seem to imply the opposite; at least it did to me, and apparently to Christian as well. The current io_lib implementation, furthermore, interprets ZTR chunks before SRF has been removed, and introduces the need of allowing for corrupt chunks within the parser. I am unaware of any chunked format that allows for "partial chunks", since the sole purpose of a chunk is to be/remain integral. This introduces the burden of maintaining state within the parser and delaying/accumulating errors in case they turn out to be intentionally introduced. The larger issue is that the spec does not yet specify expected behavior in this regard, and this should probably be addressed in advance of any coding changes. Kurt At 09:38 AM 6/12/2008, James Bonfield wrote: >On Thu, Jun 12, 2008 at 08:01:00AM -0400, Goina, Cristian wrote: > > It's true the spec doesn't say that the DBH header blob should be a > > valid ZTR but in some sense in the way the description of each block it > > is structured implies some linearity. > >Agreed on the implications. The spec doesn't explicitly allow for >the strange case of splitting part way through a chunk header, but >it's too easy to assume that means it disallows it. That wasn't the >intention though and it was one of the main reasons I wrote the >rationale section. > > > I don't think it is right to > > spread ZTR blob bytes across consecutive blobs. Moreover the case here > > is that 6 bytes from DBH header blob belong to the next SRF block and > > this is definitely wrong. The spec doesn't say that some of the SRF data > > header blob bytes may be part of the next following SRF data blob either > >If we wrote out each individual trace as a separate ZTR file on disk >then we'd observe that, for example, the first 150 bytes of every file >are the same. So it's a logical conclusion from there to saying we'll >split our files into a header + body with the common header component >(150 bytes) being written out once followed by all the bodies (file >length - 150 bytes). > >That's at the byte level though. If you work in ZTR chunks then you >may see that the ZTR header and first two chunks take you to file >offset 140 say and the next chunk differs, despite the first 10 bytes >of it being the same. It's not so efficient though and we're >needlessly storing extra data over and over again. > >Furthermore I figured that as SRF is strictly a container format (like >tar or zip) then it should not have explicit knowledge of the file >format of the items held within it (ZTR) and instead we should just >treat them as a stream of bytes. This is flexible as it means we could >put SCF traces in there, ZTR v2, CTF, or whatever we end up with down >the line. > > > I was wrong I thought it failed because of memory problems because I saw > > it failed at an unusual point but I think it just reached the end of the > > corresponding virtual file (mFILE) > >Correct, it hits a virtual EOF and returns in error. > > > > The naive approach is not decode the DBH ztr components at > > > all and paste DBH + DB together and then decode the ZTR as a > > > whole. This works and is simple to implement, but it's slow. > > > It's how I initially did things though to test the early components. > > > > These files are way too big and concatenating in memory is not > > feasible. With smaller DB let's say this might work but somehow that non > > linear solution still doesn't appeal to me. I think it just asks for > > trouble. You're approach practically uses two file pointers one for the > > SRF blocks and one for ZTR blobs that are part of those SRF blocks. > >I don't follow this. We don't have one large DB but rather lots of >tiny DBs. Ie the invididual ZTR files are represented by a single DBH >plus any single DB - combined they're the same size as a single ZTR >file holding a single trace, which is tiny. > >So logically speaking converting an entire SRF file to fasta we'd do: > >while (B = next SRF block) > if B.type is Data Block Header > H_blob = B.blob > end if > > if B.type is Data Block (aka Body) > ztr_file = H_blob + B.blob (string concatenation) > decode ztr_file > convert and print to fasta > end if >end while > >Ie the body is attached the the most recently seen header. We just >remember the last header we observed and repeatedly append differing >bodies to it to generate the complete ZTR file before decoding. > >Now the inefficiencies with this are that we end up repeatedly >decoding the ZTR chunks that are head within the header blob, so >that's where the partial decoding functions come from. However that's >simply an implementation detail and doesn't impact on the file format >at all. > > > > Then why not separate all ZTR chunks completely from the SRF blocks and > > instead in the SRF blocks you only keep a reference (file offset) to the > > corresponding ZTR chunk or chunks. This kind of approach is similar to > > data normalization in relational databases - now we store the entire > > ZTR(s) - let's change it to store only a reference to those ZTR(s) and > > keep the ZTR(s) separate - this is a non linear solution that can still > > keep things clearly defined. > >Firstly we wanted a format amenable to streaming, so it could be >generated within a UNIX pipeline or you could be decoding it in >parallell with downloading it. This also helps to keep memory usage >minimal. > >Secondly having to access two different locations within a file is >more disk I/O and while transfer speeds are improving seek times can >still be very expensive. By separating a trace into the common and >variable sections we implicitly need two locations, but we can specify >the order things occur in to remove the need for seeks (ie by caching >the most recent DBH). For random access (eg looking up a trace by >name in the SRF index) we naturally need to access two locations >(header + body) but there's not much we can do about that. > > > This doesn't mean that I agree with the solution :-) - I like it it's an > > elegant solution for dealing with this kind of problems but for all > > practical purposes it only smells trouble - I am talking about spreading > > ZTR bytes from the datablock in the DBH. > >Agreed it's maybe more logical to break the header and body apart at a >ZTR chunk boundary, but I just couldn't pass up such an easy >opportunity to save a few more bytes per trace given there's nothing >in the SRF spec that has explicit knowledge of ZTR file format. It's >still only a couple of percent saved though. > > > > At one stage a version of this was using +6 as I realised > > > that the meta data varied in length (sometimes it has > > > OFFS\011111\0 and other times it may be OFFS\088888\0). I'm > > > guessing that version produced the SRF file we're looking at. > > > I later backed that change out again to the code above as it > > > then dawned on me that I'd made a simple error and that > > > between tiles the meta-data may be different lengths, but > > > it's within a tile that matters and for that case it's always > > > the same - hence the meta-data can be put in the DBH too. > > > > I'm not quite following this - has this problem actually been fixed > > already or it's still possible that some portion of the first's block > > metadata will be in the DBH > >The "problem" still exists, but arguably more so now as an even larger >portion of an incomplete ZTR chunk is in the data block header. > > > James do you mean SRF v2.0 or ZTR v2.0 - I just realized that here you > > opt for the same thing of separating data (ZTR chunks) from headers (SRF > > block) > >We have ideas for ZTR v2, but I don't think this yet requires a major >version bump of SRF. I'm still unsure if we even want to increase the >major version number of ZTR yet either, but that depends on whether we >break backwards compatibility. (There's plenty of ghastly cruft in >there we should remove and a few inconsistencies to tidy up, which >would entail breaking compatibility, but it's maybe easier to simply >add our changes incrementally in a compatible fashion and just live >with old cruft.) > >James > >-- >James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova > | Plurima gyrabant gymbolitare vabo; > A Staden Package developer: | Et Borogovorum mimzebant undique formae, >https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. > > >-- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. >_______________________________________________ >Ssrformat mailing list >Ssrformat@mail.bcgsc.ca >http://www.bcgsc.ca/mailman/listinfo/ssrformat From asims at bcgsc.ca Thu Jun 12 12:54:17 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Thu Jun 12 12:56:54 2008 Subject: [Ssrformat] Mate pairs in SRF References: <3EC2264E4927524DB3F1C0425831CCA51138BF@EXCHANGE.TIGR.ORG> <20080609153932.GA27919@sanger.ac.uk> <86C6E520C12E52429ACBCB01546DF4D3016797FE@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA51138C4@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D301679801@xchange1.phage.bcgsc.ca> <20080612080828.GH27919@sanger.ac.uk> Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679802@xchange1.phage.bcgsc.ca> I thought about that (2 DBH block per panel too), but since the data is all mixed up in the SOLiD files, its a pain to code. The spec supports both implementations. Asim ________________________________ From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Thu 12/06/2008 1:08 AM To: Asim Siddiqui Cc: Goina, Cristian; ssrformat@bcgsc.ca Subject: Re: [Ssrformat] Mate pairs in SRF Hello Asim, > Jon and I spoke and we recommend padding the BASE (and CNF1 and > SAMP) chunk missing mate instead. > Padding is going to pop up anyway (see next paragraph) and with RLE > compression requires less space than an additionl REGN block. I had another idea last night. What percentage of sequences end up with just one end? If it's just a handful then padding is maybe a sensible solution. However I thought perhaps we could generate two DBH blocks per panel instead of one. The first consists of all reads with both ends present and the second with the reads only having one end. This means we don't need any fancy codes to indicate the end isn't present or add fake data to pad it out. The overheads would be minimal too. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080612/1015354b/attachment.htm From jkb at sanger.ac.uk Fri Jun 13 01:54:25 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri Jun 13 01:54:35 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <7.0.0.16.1.20080612121610.01957578@ncbi.nlm.nih.gov> References: <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> <20080612133823.GK27919@sanger.ac.uk> <7.0.0.16.1.20080612121610.01957578@ncbi.nlm.nih.gov> Message-ID: <20080613085425.GN27919@sanger.ac.uk> On Thu, Jun 12, 2008 at 12:42:20PM -0400, Kurt Rodarmer wrote: > The current io_lib implementation, furthermore, interprets ZTR chunks > before SRF has been removed, and introduces the need of allowing for > corrupt chunks within the parser. I am unaware of any chunked format > that allows for "partial chunks", since the sole purpose of a chunk > is to be/remain integral. This introduces the burden of maintaining > state within the parser and delaying/accumulating errors in case they > turn out to be intentionally introduced. Calling the data "corrupt" is somewhat misleading. It's not corrupt, it's just (currently) incomplete. All io_lib is doing is decoding the complete chunks and delaying the incomplete one. Granted it doesn't know a chunk is incomplete until it tries and fails to decode it, but the logic is pretty straight forward - delay anything you can't handle until the body has been read in. Furthermore the original implementation of io_lib didn't do this and worked exactly as described in your initial (unquoted) paragraph - treating SRF as a transport layer, reassembling the packets and then decoding the final result. If you prefer to implement the parser in this way then that's fine, although it's slow. It is inefficient as it leads to repeated decoding of the same data. Keeping with a network-style "packet" analogy, io_lib is effectively therefore implementing a caching system where by it's already decoded some of the chunks in this packet as they're the same as the ones in the previous packet. The saving is quite considerable. I've no idea if real network protocols employ the same tricks, but I would imagine so. The purest public view of what goes on is written up in nice tidy standards, but the actual nuts and bolts of fast implementations is most likely warty and ugly. > The larger issue is that the spec does not yet specify expected > behavior in this regard, and this should probably be addressed in > advance of any coding changes. 100% agree. If people feel the simplicity of mandating the header/body break occurs on a ZTR chunk boundary is worth the extra storage then fair enough - my own opinions however well known by now :-) James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Fri Jun 13 02:22:47 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri Jun 13 02:23:24 2008 Subject: [Ssrformat] Re: About SRF parsing In-Reply-To: <7.0.0.16.1.20080612121610.01957578@ncbi.nlm.nih.gov> References: <484EA336.5080403@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C8@EXCHANGE.TIGR.ORG> <484FD8A9.1070301@ebi.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138C9@EXCHANGE.TIGR.ORG> <20080611151525.GG27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CC@EXCHANGE.TIGR.ORG> <20080612085442.GI27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138CD@EXCHANGE.TIGR.ORG> <20080612133823.GK27919@sanger.ac.uk> <7.0.0.16.1.20080612121610.01957578@ncbi.nlm.nih.gov> Message-ID: <20080613092247.GO27919@sanger.ac.uk> On Thu, Jun 12, 2008 at 12:42:20PM -0400, Kurt Rodarmer wrote: > The larger issue is that the spec does not yet specify expected > behavior in this regard, and this should probably be addressed in > advance of any coding changes. Having reread the spec it doesn't even seem to state that the header blob should be attached to the data block blob. I'll take a look at rewording this to make it clearer, and update the rationale too. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Fri Jun 13 04:39:35 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri Jun 13 04:39:42 2008 Subject: [Ssrformat] SRF v1.3.1 proposed changes Message-ID: <20080613113935.GQ27919@sanger.ac.uk> Hello all, I've just re-read the SRF v1.3.1 specification in light of recent discussions and have a few proposed changes to wording only. The format is sound and doesn't need changing that I can see, so I propose we produce v1.3.2 with descriptive changes only. Suggestion 1: Background section This includes the sentence "This document is preliminary and subject to change". While it's clear it may change, I'd argue it's matured beyond the preliminary change and is now in production. My suggestion is to delete the setence. Suggestion 2: 5 - Requirements Oddly we have two RQ-3 requirements (initially pointed out by someone else, but I forget whom). Suggestion: rename the first RQ-3 to be RQ-10 and shuffle it to the end of the list. Suggestion 3: 6.1 - General The description of "Data Blocks utilize the ZTR format (R-5) and comprise a Read Header (RH) and a ZTR blob" is very misleading. I'd start a new paragraph and state: The contents of the Data Block Header "headerBlob" and Data Block "dataBlob" when concatenated together yield a single trace file, currently only in ZTR format. Suggestion 4: 6.1 - General You have three figure 6.2s. We should relabel these to be 6.2, 6.3 and 6.4. Suggestion 5: 6.1 - General Just above the middle fig6.2 there is the setence "Note: If the DBH blocks are in a separate file, they are not removed from the container i.e. they are present in both files". Why? Let us review why we provided support for splitting apart DBHs from the main SRF file. It was for cases where we may wish to reorder the traces within an SRF file such that the traces are in (exact or approximate) order of assembly. This utilises streaming and/or disk caching to speed up viewing of all traces within a specific display region of the genome. Doing so however breaks the implicit link of which DB belongs with which DBH, hence we the requirement for the index in this situation. Secondly it means that the position of the DBHs in the main SRF file becomes irrelevant. I still haven't quite understood why we want to store DBHs therefore in another file (if anything I'd expect it to be the reorder DBs that get put into the other file). If we do move DBHs out, what is the purpose of still retaining them in the main SRF file given that we will have needed to completely rewrite it to reorder the DBs? The only suggestion I have here is to try and figure out what problem it is we're trying to address with this solution. I can't see it. Suggestion 6: 6.1.1 - Strings I know of cases where people implementing the specification initially jumped to conclusion when they saw "char *" and interpreted it as nul terminated C-style string. This is despite the section clearly stating otherwise of course. Suggestion: use the word "string" in place of "char *" in the 'Type' headings as the C syntax of "char *" is too closely linked in peoples minds with nul terminated string. Suggestion 7: 6.5 - Data Block Personally I'd prefer to rename this here and elsewhere to be Data Block Body. I just dislike Data Block being a prefix of Data Block Header as it may cause ambiguities - eg is Data Block Header just a special type of Data Block? Suggestion 8: 6.5.2- ZTR Blob This needs some substantial reworking. Right now it implies that the dataBlob is a valid ZTR file (which it is not as it almost certainly does not contain the ZTR header with the magic number) and there's no description at all of how headerBlob comes in. I'd suggest: The dataBlob appended to headerBlob from the most recently preceeding Data Block Header together produce the trace file in whatever format is specified in containerType in the Container Header[1]. Currently the only supported type is ZTR, although this may change in the future. Note that the junction between headerBlob and dataBlob may not necessarily occur on a ZTR chunk boundary. [1] Footnote: in cases of reordered files the index will need to be consulted to determine which Data Block Header is associated with this Data Block. Suggestion 9: 6.6.1 - Hash Function It's possible that the web page listed may vanish or the contents may change. The source code for lookup3 is explicitly placed in the public domain (which is one reason I chose it). Legally I doubt we can assume the same is true for the descriptive content of the web page, but there's not a decent description of the specific lookup3 algorithm there anyway. Suggestion: write a pseudocode style description of the lookup3 algorithm so we can include it in the specification itself. Suggestion 10: Rationale. I updated this somewhat after various queries about ZTR chunk boundaries. Naturally it's open to amendment if the general consensus is to add a requirement for SRF to enforce splitting headerBlob and dataBlob on an exact chunk boundary (although ZTRv2.0 should make this entire point moot), but for now it describes SRF as it is currently used and/or abused. Rationale appended. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- Rationale ========= This rationale describes some of the design decisions behind the SRF file format. In case of any errors in the rationale, for any conflicts between this section and the main SRF specification itself the SRF specification shall always take precedence. The new wave of sequencing machines sequence millions of DNA fragment at a time meaning that it is no longer viable to have one trace per file on disk. Historically several file formats have existed for storing multiple sequences and/or quality values in a single file (fasta, fastq). However the standard formats (SCF, ZTR, RCF) for the "trace" data itself - the raw measurements used by the base-callers - all support only one trace per file. Hence the introduction of a new trace archive file format: SRF. Basic elements -------------- There are two key elements to fulfill when storing these traces. 1) What format the actual trace data will take. 2) What format the archive will take that groups the traces together. The SRF format and hence this document is primarily concerned with the second of these two elements. An analogy makes this clearer: an earlier trace submission format to the NCBI and EBI trace archives is a tar file containing either gzipped SCF or ZTR traces. In that case "1)" above is SCF.gz or ZTR and "2)" above is the tar file format. Trace format ------------ The SRF container structure states which individual trace format shall be used. As of writing this only a single format is supported - ZTR, or more specifically ZTR version 1.3 and above. However provision has been made for additional formats to be added as required. ZTR was chosen because of its flexibility and the ability to extend it with additional types of data. However there is nothing specific to the rest of the SRF specification which could not be changed to support other future trace formats. See the ZTR specification for more information. The container format - why a new one? ------------------------------------- Evidently 'tar' was already in regular use as a container, so why not continue with it, or to use an alternative such as ZIP. Firstly our requirements state we need both a streamable format (RQ-2) and to support random access (RQ-3). This rules out tar due to it having no index, although an index could easily be added. Requirement RQ-3 states that data shall be stored in an efficient manner. In pursuit of this goal we determined that splitting individual trace files into multiple sections allows for a reduction in file size and this in turn rules out the majority of (if not all) other standard archive formats. Trace headers and body ---------------------- An important aspect of file size reduction comes from the observation that each individual trace file we wish to house within SRF may have some common data associated with it. Examples may include information on the program version numbers, matrix files used, the location of the forward and reverse fragments of a mate-pair, etc. The variable non-common component of the trace files will typically be sequence, quality values and the discrete "trace" amplitudes themselves. SRF exploits the hidden assumption that our trace file format allows us to move all the common invariant parts of a trace to the start of the byte stream that encodes that trace, which implies that a set of traces can be considered to consist of an invariant "common" header and a variable footer or body. Given this assumption SRF provides the ability to store the trace as a pair of octet streams (referred to as "blobs" in the specification) with the common portion residing in a "Data Block Header" and the remainder residing within a subsequent "Data Block". The expectation is that we have a few Data Block Headers and many Data Blocks. Both the Data Block Header and Data Block contain fixed structure meta-data (such as the name) as well as the octet stream for the trace file itself. The individual trace files may be regenerated by the concatenation of these two octet streams (Data Block Header followed by Data Block). Note that SRF does not dictate which components of a trace file (eg bases, trace samples, textual comments) belong in which octet stream, partly to keep flexibility and partly because we do not want to tie SRF down to being ZTR specific. This even implies that the boundary between the Data Block Header blob and the Data Block blob does not have to occur on a ZTR chunk boundary. An implementor of SRF may choose to store all data inside the Data Block "dataBlob" and none within the Data Block Header "headerBlob", but doing so is likely to be suboptimal use of storage space. Trace names ----------- Some concern was raised over the trace names becoming a significant portion of the SRF file size. To address this we utilise the Data Block Header and Data Block split once more to store a name prefix and a name suffix with the idea being that all traces within a single Data Block Header (such as from an Illumina Solexa "tile" or an ABI SOLiD "panel") will share a common name prefix. Typically this just leaves a small portion of the name to be stored per trace in the Data Block itself. Provision has also been made to store that suffix portion in a more compact binary format instead of just the printable ASCII component. Index ----- The SRF container as described above is a streamable format. We can sequentially read through the archive caching the most recently read Data Block Header in order to concatenate with each Data Block and construct the full binary trace in turn, along with the trace names. To satisfy RQ-3 however we need to have some way of jumping into a file at a specific point. We satisfy this requirement by providing the ability to attach an index to the SRF file, but note that this is not a mandatory feature. The basic construction of the index is an in-file hash table keyed on trace (or "read") name. Note that the trace names themselves could be a significant percentage of the file size if duplicated in the index, hence they are missing. This means looking up a key in the hash table may sometimes yield multiple trace matches. It is up to the SRF implementor to then read each potential trace match (even if it's the only one) in order to verify that the name matches the search term. Given that a trace file is housed in two separate locations on disk (the Data Block Header and the Data Block) we theoretically need two file offsets for each element in the index. However we can usually exploit the fact that we know a Data Block is associated to the preceding Data Block Header. So storing an array of Data Block Header offsets allows us to rapidly identify which Data Block Header to read for any given Data Block offset. The case where this implicit link between Data Block and Data Block Header is broken is where we wish to store these two elements in two distinct files. Block reordering ---------------- There may be a desire to reorder the traces within an SRF file. One such example is to order them to match the order of reads aligned within an assembly so that standard assembly operations (e.g. viewing a particular region or computing a consensus sequence) will access only a small region of the SRF file. The problem with this is that it is very unlikely that a set of sequences aligned within the same region of an assembly all share the same Data Block header; indeed it's likely they'll have random Data Block Headers. One solution is to split the two apart, with Data Block Headers in one file and Data Blocks in another, allowing for Data Blocks to be sorted by alignment position. However as noted above doing so breaks the implicit link between which Data Block and Data Block Header belong together. For this reason special support is made for this in the Index block by allowing for both file offsets to be stored along with the relevant file-names. From jkb at sanger.ac.uk Mon Jun 16 07:09:09 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Mon Jun 16 07:09:15 2008 Subject: [Ssrformat] multi-file format SRF Message-ID: <20080616140909.GY27919@sanger.ac.uk> Hello all, Does anybody here make use of the multi-file format of SRF? Please I'm looking for use cases (either actual or planned) as without them we're stumbling in the dark. I'm still not entirely sure of the purpose for the multi-file format. Let's think it through: right now we support having the index in its own file or attached to the container. I think that's maybe OK as the index is an optional component anyway. It doesn't really matter where it is. We also have support for DBHs being in their own file instead of or in addition to the main container. When this happens the index is a mandatory requirement. Now my understanding is that this was all brought about through the desire to be able to reorder SRF files such that the data is in, say, assembly order. It makes viewing faster and in some cases consensus calling too (eg 454). Is multiple files the best way to achieve it though? I've been trying to think of all the alternatives, so here's a quick brain dump of a few methods of handling DB reordering. A) As is described currently. We have the main srf container file with an XML block if present and DBs (and optionally duplicate copies of DBHs), plus a DBH file, plus an index. We reorder the DBs but keep the DBHs in their original ordering. B) We have one single file. In it we just shuffle the DBHs to the start (first pass through) and then put all the DBs, in their new order, following those and ending it all with the index. C) We have one single file with DBs always associated with the preceeding DBHs - just like a standard SRF. However to achieve this we've attached all original DBH content back onto the DBs, reordered, and detached them again to form new DBH records. The final decoded trace content is the same, but the encoded form may differ. I can see pros and cons for each. Method A -------- Pro: It's relatively simple. Con: The index is now mandatory. It requires multiple files, meaning there's scope for cock-ups if we rename files. (Renaming files is possible, but hard as it requires editing of the index file.) The file is no longer streamable. Loading all traces into a trace archive maybe complex. We'd need to consult the index and inevitably will end up jumping all over the place as iterating through DBHs means random access on DBs and conversely iterating through DBs means random access on DBHs. This is partially resolved by using two passes, the first of which loads all DBHs and the index into memory (re-sorted by DB position) and the second of which streams DBs through. Method B -------- Pro: One file solution. Relatively simple. Con: The index is now mandatory. The file is no longer streamable. Loading all traces into a trace archive maybe complex. (See A above) Method C -------- Pro: One file solution. Streamable still, hence fast for loading into a database or reading by an assembler. Simplifies the SRF specification as it no longer needs any multi-file support and we can claim reordering is simply someone elses job. Does not require the index. Con: Complex to implement (not for SRF, but for whoever has the task of reordering the data). We could to do this in multiple passes. 1) Produce a new SRF file with one single empty DBH: all ZTR data is uncompressed and in the DBs (name prefixes too). 2) Reorder DBs produced in step 1 as desired. 3) Compress DBs in blocks (say 10,000) producing new DBHs once more every so often - as usual these include ZTR header, HUFF chunks, and so on. Possibly a small growth in output - eg not being able to use the name prefix in the DBH would add maybe 0.3 bytes per base. Recreating the original data after reordering is tricky. Would we want to? We still get the same decoded content, just not the same encoding. Conclusion ---------- Given that both method B and C are possible to implement right now as a single SRF file without changes to the specification, is there still a need for multiple file support in SRF? Both methods A and B break the implicit DB follows DBH rule, hence the requirement for an index so we do not lose the association. Even in a multi-file scenario, the SRF specification allows for us to keep the DBHs present in the original file (stated in 27th Feb 2008 revision). Hence it is not possible to tell when reading an SRF file that has an index whether or not the file has been reordered. This in turn means that programs such as srf2fastq, srf2fasta, etc all need in theory to do all I/O via the index if one exists, just incase it happens to be a reordered file. This is rather cumbersome and will slow things up a lot. So we need at least one of the following to be changed: - Mandate that reordered files MUST have the index in a separate filename to the container. It's somewhat arbitrary and seems like a hack. - Add a flag to the container indicating that the data has been reordered and is no longer streamable. (Seems like a reasonable solution, but it requires changes to the file format.) - Remove the multiple-file and reordering support completely. Any reordering performed is done outside of the SRF specificaton itself and as long as the result is still a valid streamable SRF file (ie method C above) then we're still good to go. Comments anyone? My preference is for the third option as it simplifies the specification and, I think, requires zero code changes. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cgoina at jcvi.org Wed Jun 18 13:13:07 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Wed Jun 18 13:13:55 2008 Subject: [Ssrformat] new version for solid to srf in java Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138D6@EXCHANGE.TIGR.ORG> Hello everybody, I just checked in a new version for converting SOLiD data to SRF in java. This new version has support for paired reads as well as support for partial chunks - SOLiD to SRF implementation doesn't use partial chunks but the code from the common module could be used to handle SRFs that use partial chunks. Asim and James, I promised you some numbers regarding mate pairs - I can't really give you real numbers but what I can tell you is that for the data I used for my testing the number of "mismatches" or reads without mates was relatively significant - close to 40%. The question is which is more efficient to put a REGN chunk per data block body or pad. Keep in mind that you have to pad the scaled intensities and quality values as well. I agree with you two DBH per panel is a pain - you have to build the data for the two DBH in memory or in some temporary area and then put them in the SRF instead of streaming data to the SRF file as you read it. My approach for now is to have a REGN chunk in the DBH and if the regions from the data block differ from the DBH add a REGN in the DB itself with the actual boundaries. Regards Cristian Goina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080618/a2172658/attachment.htm From jkb at sanger.ac.uk Thu Jun 19 01:29:37 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu Jun 19 01:29:45 2008 Subject: [Ssrformat] Re: new version for solid to srf in java In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138D6@EXCHANGE.TIGR.ORG> References: <3EC2264E4927524DB3F1C0425831CCA51138D6@EXCHANGE.TIGR.ORG> Message-ID: <20080619082936.GK27919@sanger.ac.uk> On Wed, Jun 18, 2008 at 04:13:07PM -0400, Goina, Cristian wrote: > my testing the number of "mismatches" or reads without mates was > relatively significant - close to 40%. A quick question - as all these failures of one end, or is it a mixture of reads with either end failing? I'm just wondering if we could cheat and have REGN in the header for all cases but state in the ZTR specification that coordinates beyond the bounds of the actual data present are to be ignored. (Yes it's hacky and I'm not sure I like the idea.) > I agree with you two DBH per panel is a pain - you have to build the > data for the two DBH in memory or in some temporary area and then put > them in the SRF instead of streaming data to the SRF file as you read > it. It sort of depends on how your existing code works. For illumina2srf I load an entire tile of data into memory anyway for purpose of producing a tailored huffman tree for all traces within it. The disk space saving was quite significant doing that. How much memory are we talking about to store an entire panels worth of data? Is it doable on an average machine or maybe more pertinently on the system supplied with the SOLiD instrument? > My approach for now is to have a REGN chunk in the DBH and if the > regions from the data block differ from the DBH add a REGN in the DB > itself with the actual boundaries. That saves 60% of the overhead I guess. I'm not sure we ever decreed what happens with multiple REGN chunks, but if there's only one single set of base calls then I'd assume we could state the most recent applies. I have been thinking on how to handle traces with multiple sets of base calls. We already (maybe temporarily) are producing traces with two sets of confidence values; calibrated and uncalibrated. Plus obviously we have the ability to store multiple traces already. My conclusion was that in the simple case we can just keep things as they are, but to support more complex forms of ZTR we can add a GROUP or GROUP_BY meta-data key/value in chunks as a means to indicate which one is associated with each other. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cgoina at jcvi.org Thu Jun 19 05:56:47 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Thu Jun 19 05:56:54 2008 Subject: [Ssrformat] RE: new version for solid to srf in java In-Reply-To: <20080619082936.GK27919@sanger.ac.uk> References: <3EC2264E4927524DB3F1C0425831CCA51138D6@EXCHANGE.TIGR.ORG> <20080619082936.GK27919@sanger.ac.uk> Message-ID: <3EC2264E4927524DB3F1C0425831CCA51138D8@EXCHANGE.TIGR.ORG> James, Jonathan would be in a better position to answer some of your questions but I'll give it a try too. > -----Original Message----- > From: James Bonfield [mailto:jkb@sanger.ac.uk] > Sent: Thursday, June 19, 2008 4:30 AM > To: Goina, Cristian > Cc: ssrformat@bcgsc.ca; Asim Siddiqui > Subject: Re: new version for solid to srf in java > > On Wed, Jun 18, 2008 at 04:13:07PM -0400, Goina, Cristian wrote: > > my testing the number of "mismatches" or reads without mates was > > relatively significant - close to 40%. > > A quick question - as all these failures of one end, or is it > a mixture of reads with either end failing? I'm just > wondering if we could cheat and have REGN in the header for > all cases but state in the ZTR specification that coordinates > beyond the bounds of the actual data present are to be > ignored. (Yes it's hacky and I'm not sure I like the idea.) The failures are not only at one end - it's more like a mixture of reads. > > > I agree with you two DBH per panel is a pain - you have to > build the > > data for the two DBH in memory or in some temporary area > and then put > > them in the SRF instead of streaming data to the SRF file > as you read > > it. > > It sort of depends on how your existing code works. For > illumina2srf I load an entire tile of data into memory anyway > for purpose of producing a tailored huffman tree for all > traces within it. The disk space saving was quite significant > doing that. > > How much memory are we talking about to store an entire > panels worth of data? Is it doable on an average machine or > maybe more pertinently on the system supplied with the SOLiD > instrument? I don't know the exact number of sequences per panel but I think the resolution of the camera is 4M pixeli and it might be possible to have 4M sequences per panel - but again I may be wrong about this. I need to add more stats to the code to be able to answer these questions and I cannot say I analyzed sufficient data to have definitive answers for these questions. > > > My approach for now is to have a REGN chunk in the DBH and if the > > regions from the data block differ from the DBH add a REGN > in the DB > > itself with the actual boundaries. > > That saves 60% of the overhead I guess. I'm not sure we ever > decreed what happens with multiple REGN chunks, but if > there's only one single set of base calls then I'd assume we > could state the most recent applies. The implementation as it is now it's not iron clad. At some point when I have the final SOLid to SRF spec if I have time I'll try to bring it up to date - until then I preferred to leave things as they are. > ... stuff deleted Cristian From asims at bcgsc.ca Thu Jun 19 13:59:53 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Thu Jun 19 14:00:08 2008 Subject: [Ssrformat] RE: new version for solid to srf in java References: <3EC2264E4927524DB3F1C0425831CCA51138D6@EXCHANGE.TIGR.ORG> <20080619082936.GK27919@sanger.ac.uk> <3EC2264E4927524DB3F1C0425831CCA51138D8@EXCHANGE.TIGR.ORG> Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679804@xchange1.phage.bcgsc.ca> I was surprised to calculate that even with XRLE2 compression it is cheaper to write an additional REGN block. REGN block ~ 4 + 4 + 4 + 10 metadata = approx 22 bytes using XREL2 compression adds for BASE: 3 bytes for CNF1: 3 bytes for each SAMP: 3 x 4 ; 4 x 12 = 48 bytes total It is the 4 SAMP blocks that are the problem. If there is just one, then the situation is reversed. Jonathon, it might be worth revisiting the decision to use SAMP rather than SMP4 to store intensity data. Some of the compression schemes will have a better time compressing SMP4 as well. ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Thu 19/06/2008 5:56 AM To: James Bonfield Cc: ssrformat@bcgsc.ca; Asim Siddiqui Subject: RE: new version for solid to srf in java James, Jonathan would be in a better position to answer some of your questions but I'll give it a try too. > -----Original Message----- > From: James Bonfield [mailto:jkb@sanger.ac.uk] > Sent: Thursday, June 19, 2008 4:30 AM > To: Goina, Cristian > Cc: ssrformat@bcgsc.ca; Asim Siddiqui > Subject: Re: new version for solid to srf in java > > On Wed, Jun 18, 2008 at 04:13:07PM -0400, Goina, Cristian wrote: > > my testing the number of "mismatches" or reads without mates was > > relatively significant - close to 40%. > > A quick question - as all these failures of one end, or is it > a mixture of reads with either end failing? I'm just > wondering if we could cheat and have REGN in the header for > all cases but state in the ZTR specification that coordinates > beyond the bounds of the actual data present are to be > ignored. (Yes it's hacky and I'm not sure I like the idea.) The failures are not only at one end - it's more like a mixture of reads. > > > I agree with you two DBH per panel is a pain - you have to > build the > > data for the two DBH in memory or in some temporary area > and then put > > them in the SRF instead of streaming data to the SRF file > as you read > > it. > > It sort of depends on how your existing code works. For > illumina2srf I load an entire tile of data into memory anyway > for purpose of producing a tailored huffman tree for all > traces within it. The disk space saving was quite significant > doing that. > > How much memory are we talking about to store an entire > panels worth of data? Is it doable on an average machine or > maybe more pertinently on the system supplied with the SOLiD > instrument? I don't know the exact number of sequences per panel but I think the resolution of the camera is 4M pixeli and it might be possible to have 4M sequences per panel - but again I may be wrong about this. I need to add more stats to the code to be able to answer these questions and I cannot say I analyzed sufficient data to have definitive answers for these questions. > > > My approach for now is to have a REGN chunk in the DBH and if the > > regions from the data block differ from the DBH add a REGN > in the DB > > itself with the actual boundaries. > > That saves 60% of the overhead I guess. I'm not sure we ever > decreed what happens with multiple REGN chunks, but if > there's only one single set of base calls then I'd assume we > could state the most recent applies. The implementation as it is now it's not iron clad. At some point when I have the final SOLid to SRF spec if I have time I'll try to bring it up to date - until then I preferred to leave things as they are. > ... stuff deleted Cristian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080619/f14deb9e/attachment-0001.htm From Jonathan.Manning at appliedbiosystems.com Thu Jun 19 14:49:15 2008 From: Jonathan.Manning at appliedbiosystems.com (Jonathan M Manning) Date: Thu Jun 19 14:49:25 2008 Subject: [Ssrformat] RE: new version for solid to srf in java In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA51138D8@EXCHANGE.TIGR.ORG> Message-ID: I've attempted to respond to these questions below. Original questions may have been trimmed. ssrformat-bounces@mail.bcgsc.ca wrote on 06/19/2008 08:56:47 AM: > James, > > Jonathan would be in a better position to answer some of your questions > but I'll give it a try too. > > > -----Original Message----- > > From: James Bonfield [mailto:jkb@sanger.ac.uk] > > Sent: Thursday, June 19, 2008 4:30 AM > > To: Goina, Cristian > > Cc: ssrformat@bcgsc.ca; Asim Siddiqui > > Subject: Re: new version for solid to srf in java > > > > On Wed, Jun 18, 2008 at 04:13:07PM -0400, Goina, Cristian wrote: > > > my testing the number of "mismatches" or reads without mates was > > > relatively significant - close to 40%. The existing pipeline is tuned for generating highest quality single tag data. There is fairly aggressive filtering going on that artificially limits the pairing rate. A read is excluded from the output unless the instrument was able to make a call for *every* position in the read. More permissive filter settings produce much higher pairing rates. Increasing the number of uncallable bases will increase coverage and pairing rate, but may introduce more noise into applications such as SNP discovery. This is one reason the default settings are so strict. That said, if existing datasets have 40% reads without mates, special handling of the unpaired reads in mate pair data would provide significant space savings. I do think it will become less of a factor going forward as the filtering is made less strict or deferred downstream to where additional context (mate pair quality) is available. > > A quick question - as all these failures of one end, or is it > > a mixture of reads with either end failing? I'm just > > wondering if we could cheat and have REGN in the header for > > all cases but state in the ZTR specification that coordinates > > beyond the bounds of the actual data present are to be > > ignored. (Yes it's hacky and I'm not sure I like the idea.) > > The failures are not only at one end - it's more like a mixture of > reads. The missing tag of the pair can be from either tag and is usually fairly balanced, F3:R3. Library construction methods and order of sequencing can bias this ratio towards one tag or the other. > > > I agree with you two DBH per panel is a pain - you have to > > build the > > > data for the two DBH in memory or in some temporary area > > and then put > > > them in the SRF instead of streaming data to the SRF file > > as you read > > > it. > > > > It sort of depends on how your existing code works. For > > illumina2srf I load an entire tile of data into memory anyway > > for purpose of producing a tailored huffman tree for all > > traces within it. The disk space saving was quite significant > > doing that. > > > > How much memory are we talking about to store an entire > > panels worth of data? Is it doable on an average machine or > > maybe more pertinently on the system supplied with the SOLiD > > instrument? > > I don't know the exact number of sequences per panel but I think the > resolution of the camera is 4M pixeli and it might be possible to have > 4M sequences per panel - but again I may be wrong about this. I need to > add more stats to the code to be able to answer these questions and I > cannot say I analyzed sufficient data to have definitive answers for > these questions. Due to several physical/optical limitations, the practical limit is well below 4M. Given the memory available on a single node of the SOLiD cluster (8G), I'd guess you'd easily fit a full panel into memory. With the potential optimizations gained by processing a panel/tile at a time, it would seem to make sense to do so. > > > My approach for now is to have a REGN chunk in the DBH and if the > > > regions from the data block differ from the DBH add a REGN > > in the DB > > > itself with the actual boundaries. > > > > That saves 60% of the overhead I guess. I'm not sure we ever > > decreed what happens with multiple REGN chunks, but if > > there's only one single set of base calls then I'd assume we > > could state the most recent applies. Again, given a higher pairing rate gained by less stringent filtering this would have a smaller impact. (Less overhead, anyway, more total data.) The relative merits of the various methods for storing mixed pairs (override REGN, padding read, separate DBH for paired/unpaired) are something that could be discussed further. The compression achieved will vary for given read lengths and pairing rates. > The implementation as it is now it's not iron clad. At some point when I > have the final SOLid to SRF spec if I have time I'll try to bring it up > to date - until then I preferred to leave things as they are. We hope to release a final version of the SOLiD to SRF spec soon. It will be sent to the mailing list when it's officially available. ~Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080619/12834d03/attachment.htm From asims at bcgsc.ca Wed Jun 25 17:48:07 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Wed Jun 25 17:55:01 2008 Subject: [Ssrformat] updated SRF and ZTR specs attached Message-ID: <86C6E520C12E52429ACBCB01546DF4D301679808@xchange1.phage.bcgsc.ca> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: SRF_v_1_3_2_June_19th_2008.pdf Type: application/pdf Size: 237392 bytes Desc: SRF_v_1_3_2_June_19th_2008.pdf Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080625/b657835d/SRF_v_1_3_2_June_19th_2008-0001.pdf -------------- next part -------------- A non-text attachment was scrubbed... Name: ZTR V1.4 Draft 25th June.pdf Type: application/pdf Size: 409639 bytes Desc: ZTR V1.4 Draft 25th June.pdf Url : http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080625/b657835d/ZTRV1.4Draft25thJune-0001.pdf