From ctoma at broad.mit.edu Thu May 1 14:58:46 2008 From: ctoma at broad.mit.edu (Camil Toma) Date: Thu May 1 14:58:55 2008 Subject: [Ssrformat] Read header readFlags byte In-Reply-To: <002d01c8a7bd$13161150$394233f0$@com> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca> <002d01c8a7bd$13161150$394233f0$@com> Message-ID: <481A3D16.1050201@broad.mit.edu> Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil From asims at bcgsc.ca Thu May 1 21:09:49 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Thu May 1 21:09:59 2008 Subject: [Ssrformat] Read header readFlags byte References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca> <002d01c8a7bd$13161150$394233f0$@com> <481A3D16.1050201@broad.mit.edu> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080501/e16645e7/attachment-0001.htm From jkb at sanger.ac.uk Fri May 2 01:23:11 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri May 2 01:23:18 2008 Subject: [Ssrformat] Read header readFlags byte In-Reply-To: <481A3D16.1050201@broad.mit.edu> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca> <002d01c8a7bd$13161150$394233f0$@com> <481A3D16.1050201@broad.mit.edu> Message-ID: <20080502082311.GA4199@sanger.ac.uk> On Thu, May 01, 2008 at 05:58:46PM -0400, Camil Toma wrote: > It would be more clear if the actual bit numbering convention used is > specified. I suggest most significant bit 0 numbering. Do we want to > change this to say something like this? Agreed it needs clarifying, if not only for the fact that my own interpretation was bit 0 meaning the least significant bit! Two quick google searches for "bit 0 LSB" and "bit 0 MSB" (hardly scientific!) shows 8140 and 1810 matches respectively. I can see logic to either way, but my preference is for bit 0 => LSB. It's not something I'd lose sleep over though :-) > I'd like to add some functionality to the solexa2srf program to support > creating SRF files from solexa data where the bad reads are included in the > SRF and the readFlags byte is set to the correct value to specify this. > Currently the solexa2srf only supports filtering of reads based on the > chastity value, which is specified via the "-c float" option. I'd like to > add a "-C float" option to mean "Don't filter reads that fail the chastity > test, mark those reads as bad". -c and -C will be exclusive options. Agreed. The same topic had come up here infact. Is Come Raczy of Illumina on this list? I'm CCing this to him just incase. Illumina are looking to take solexa2srf and make it an inhouse supported tool (renamed to illumina2srf), but via joint development in the io_lib portion of the Staden Package CVS tree at SourceForge. I'm more than happy with this scenario as I only wrote the thing in the first place out of local needs and this means I don't have to support it :-). I need to do a bit of quick merging first though. If you wanted to do the changes yourself though Camil we can get those folded in too. > I'd also like to add -C to the srf2solexa program to allow for filtering of > reads on decompression back to native solexa. Sounds like a plan. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ctoma at broad.mit.edu Fri May 2 05:48:47 2008 From: ctoma at broad.mit.edu (Camil Toma) Date: Fri May 2 05:48:55 2008 Subject: [Ssrformat] Read header readFlags byte In-Reply-To: <20080502082311.GA4199@sanger.ac.uk> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca> <002d01c8a7bd$13161150$394233f0$@com> <481A3D16.1050201@broad.mit.edu> <20080502082311.GA4199@sanger.ac.uk> Message-ID: <481B0DAF.3050800@broad.mit.edu> I have no problem with using LSB 0 ordering. I'll do the changes to solexa2srfsrf2solexa here and email the diffs when they are ready. -camil James Bonfield wrote: > On Thu, May 01, 2008 at 05:58:46PM -0400, Camil Toma wrote: >> It would be more clear if the actual bit numbering convention used is >> specified. I suggest most significant bit 0 numbering. Do we want to >> change this to say something like this? > > Agreed it needs clarifying, if not only for the fact that my own > interpretation was bit 0 meaning the least significant bit! > > Two quick google searches for "bit 0 LSB" and "bit 0 MSB" (hardly > scientific!) shows 8140 and 1810 matches respectively. I can see logic > to either way, but my preference is for bit 0 => LSB. It's not > something I'd lose sleep over though :-) > >> I'd like to add some functionality to the solexa2srf program to support >> creating SRF files from solexa data where the bad reads are included in the >> SRF and the readFlags byte is set to the correct value to specify this. >> Currently the solexa2srf only supports filtering of reads based on the >> chastity value, which is specified via the "-c float" option. I'd like to >> add a "-C float" option to mean "Don't filter reads that fail the chastity >> test, mark those reads as bad". -c and -C will be exclusive options. > > Agreed. The same topic had come up here infact. > > Is Come Raczy of Illumina on this list? I'm CCing this to him just incase. > > Illumina are looking to take solexa2srf and make it an inhouse > supported tool (renamed to illumina2srf), but via joint development in > the io_lib portion of the Staden Package CVS tree at SourceForge. I'm > more than happy with this scenario as I only wrote the thing in the > first place out of local needs and this means I don't have to support > it :-). I need to do a bit of quick merging first though. > > If you wanted to do the changes yourself though Camil we can get those > folded in too. > >> I'd also like to add -C to the srf2solexa program to allow for filtering of >> reads on decompression back to native solexa. > > Sounds like a plan. > > James From ISingh at jcvi.org Fri May 2 08:23:03 2008 From: ISingh at jcvi.org (Singh, Indresh) Date: Fri May 2 08:23:11 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> Message-ID: Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/2082ff7b/attachment.htm From asims at bcgsc.ca Fri May 2 09:13:37 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri May 2 09:13:48 2008 Subject: [Ssrformat] SOLID SRF implementation References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/961c3db8/attachment-0001.htm From cgoina at jcvi.org Fri May 2 09:17:00 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Fri May 2 09:17:27 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA5113842@EXCHANGE.TIGR.ORG> Asim, I'm trying to put up a little bit more formal document than what I submitted to the group and that will be submitted with the code as well. --cristian ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/b9aa9261/attachment.htm From ISingh at jcvi.org Fri May 2 09:19:09 2008 From: ISingh at jcvi.org (Singh, Indresh) Date: Fri May 2 09:19:16 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> Message-ID: Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/511b6c7c/attachment-0001.htm From asims at bcgsc.ca Fri May 2 09:30:50 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri May 2 09:35:14 2008 Subject: [Ssrformat] SOLID SRF implementation References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> Sounds good - BTW how are you handling ZTR? Presumably this means that there will be code in SRF that is a ZTR implementation? On a related note, in a rewrite of the SRF C++ code, I'm creating a "base" directory that doesn't have any dependencies on ZTR. The only one I'm unable to remove is os.h which handles endianess. I've been tempted to copy this code to SRF. Any thoughts on that? Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 9:19 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/f78d5990/attachment.htm From cgoina at jcvi.org Fri May 2 09:52:22 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Fri May 2 09:52:29 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA5113846@EXCHANGE.TIGR.ORG> You are correct. I pretty much have a full implementation of ZTR in java. By the time James sent me the pointer to the existing implementation I already had it and that implementation doesn't have support for STHUFF anyway so I chose to add support for STHUFF to our code. As for handling the indianess in c or c++ can be done using a simple macro. I don't see any problem in taking the os.h but you may still want to add the autoconf support and generate the dependencies correctly using configure as James did with io_lib. Or you can implement your code both for BE and LE architectures and decide at runtime the right branch. Cristian ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:31 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Sounds good - BTW how are you handling ZTR? Presumably this means that there will be code in SRF that is a ZTR implementation? On a related note, in a rewrite of the SRF C++ code, I'm creating a "base" directory that doesn't have any dependencies on ZTR. The only one I'm unable to remove is os.h which handles endianess. I've been tempted to copy this code to SRF. Any thoughts on that? Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 9:19 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/2319d0bb/attachment-0001.htm From jkb at sanger.ac.uk Fri May 2 09:55:00 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri May 2 09:55:07 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> Message-ID: <20080502165459.GE4199@sanger.ac.uk> On Fri, May 02, 2008 at 09:30:50AM -0700, Asim Siddiqui wrote: > On a related note, in a rewrite of the SRF C++ code, I'm creating a > "base" directory that doesn't have any dependencies on ZTR. The only > one I'm unable to remove is os.h which handles endianess. I've been > tempted to copy this code to SRF. Any thoughts on that? Io_lib is seriously legacy code, so I can understand your desires to rid yourself of too many links. (There are parts of it over 15 years old - the ABI reading code has dates in the source dating back to 1990, and it likely is derived from earlier bits too.) I think ideally the SRF code should be agnostic to the data type held within it, or at least at the core level. A second layer built on top of it for the more useful tools (srf2fasta, etc) obviously would still need knowledge of ZTR. James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jkb at sanger.ac.uk Fri May 2 09:59:23 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Fri May 2 09:59:29 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <3EC2264E4927524DB3F1C0425831CCA5113846@EXCHANGE.TIGR.ORG> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797D3@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA5113846@EXCHANGE.TIGR.ORG> Message-ID: <20080502165923.GF4199@sanger.ac.uk> On Fri, May 02, 2008 at 12:52:22PM -0400, Goina, Cristian wrote: > As for handling the indianess in c or c++ can be done using a simple > macro. I don't see any problem in taking the os.h but you may still want > to add the autoconf support and generate the dependencies correctly > using configure as James did with io_lib. Or you can implement your code > both for BE and LE architectures and decide at runtime the right branch. It's worth pointing out that I don't really use autoconf properly for io_lib. The code is just one directory of a much larger package (that doesn't use autoconf, ironically for portability reasons), so many of the proper autoconf #defines for endianness etc aren't consulted by io_lib's C code. I know it's messy. These days autoconf seems to work much better on windows under MinGW/MSYS so I can probably convert the entire Staden Package to using it, which would help things. Although that means using libtool for the shared libraries and I've never had anything but woe down that route. Indeed I removed it from io_lib in the end as libtool wouldn't run properly on many of the systems I needed to support. Agh, when did auto* stop becoming a means to write portable code? Sorry, rant over :-) James -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cgoina at jcvi.org Fri May 2 14:02:42 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Fri May 2 14:02:53 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA5113848@EXCHANGE.TIGR.ORG> Asim, I've been looking at the SRF sourceforge project site and trying to decide where to put this code. Currently there is / -srf |-docs |-src |-test I was thinking to add my java implementation at the same level with srf and then later on docs could be moved at the same level as well / -srf |-docs |-src |-test -srf-java |-src |-lib |-testdata or under srf we can create a folder for c++ and move everything that currently is in srf in the c++ folder except the docs and the testdata. The testdata can also contain folders with samples specific to different sequencers such as solexa, 454, solid / -srf |-c++ |-src |-test |-java |-src |-lib |-fortran (when an implementation is available - actually I think I found a fortran implementation already somewhere on the web) |-perl (when an implementation is available) |-python (when an implementation is available) |-docs |-testdata |-454-samples |-solexa-samples |-solid-samples My preference is for the latter, but I can live with either one. Or if you have a different idea just let me know where should we put this code Thanks Cristian ________________________________ From: Singh, Indresh Sent: Friday, May 02, 2008 12:19 PM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/a0aeccfb/attachment-0001.htm From asims at bcgsc.ca Fri May 2 14:27:24 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri May 2 14:31:52 2008 Subject: [Ssrformat] SOLID SRF implementation References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA5113848@EXCHANGE.TIGR.ORG> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797D6@xchange1.phage.bcgsc.ca> I prefer the latter option as well, so let's go with that. The top level directories are modules, so we'll need to create two new modules (C++ and java) and deprecate/move/delete the old one. I am revising the c++ code as well and hope to have something up on the site around Wednesday. BTW the OS license for code is the APACHE 2.0 license. Asim ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Fri 02/05/2008 2:02 PM To: Singh, Indresh; Asim Siddiqui; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Asim, I've been looking at the SRF sourceforge project site and trying to decide where to put this code. Currently there is / -srf |-docs |-src |-test I was thinking to add my java implementation at the same level with srf and then later on docs could be moved at the same level as well / -srf |-docs |-src |-test -srf-java |-src |-lib |-testdata or under srf we can create a folder for c++ and move everything that currently is in srf in the c++ folder except the docs and the testdata. The testdata can also contain folders with samples specific to different sequencers such as solexa, 454, solid / -srf |-c++ |-src |-test |-java |-src |-lib |-fortran (when an implementation is available - actually I think I found a fortran implementation already somewhere on the web) |-perl (when an implementation is available) |-python (when an implementation is available) |-docs |-testdata |-454-samples |-solexa-samples |-solid-samples My preference is for the latter, but I can live with either one. Or if you have a different idea just let me know where should we put this code Thanks Cristian ________________________________ From: Singh, Indresh Sent: Friday, May 02, 2008 12:19 PM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description -------------------------------------------------------------------------------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/e6054d8a/attachment-0001.htm From cgoina at jcvi.org Fri May 2 14:58:43 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Fri May 2 14:58:53 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797D6@xchange1.phage.bcgsc.ca> References: <3EC2264E4927524DB3F1C0425831CCA5113831@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797C0@xchange1.phage.bcgsc.ca> <000601c8a758$eecb2330$cc616990$@com> <86C6E520C12E52429ACBCB01546DF4D3016797C2@xchange1.phage.bcgsc.ca><002d01c8a7bd$13161150$394233f0$@com><481A3D16.1050201@broad.mit.edu> <86C6E520C12E52429ACBCB01546DF4D3016797CC@xchange1.phage.bcgsc.ca> <86C6E520C12E52429ACBCB01546DF4D3016797CD@xchange1.phage.bcgsc.ca> <3EC2264E4927524DB3F1C0425831CCA5113848@EXCHANGE.TIGR.ORG> <86C6E520C12E52429ACBCB01546DF4D3016797D6@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA511384C@EXCHANGE.TIGR.ORG> If the license is APACHE I think you should be fine. --cg ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 5:27 PM To: Goina, Cristian; Singh, Indresh; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation I prefer the latter option as well, so let's go with that. The top level directories are modules, so we'll need to create two new modules (C++ and java) and deprecate/move/delete the old one. I am revising the c++ code as well and hope to have something up on the site around Wednesday. BTW the OS license for code is the APACHE 2.0 license. Asim ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Fri 02/05/2008 2:02 PM To: Singh, Indresh; Asim Siddiqui; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Asim, I've been looking at the SRF sourceforge project site and trying to decide where to put this code. Currently there is / -srf |-docs |-src |-test I was thinking to add my java implementation at the same level with srf and then later on docs could be moved at the same level as well / -srf |-docs |-src |-test -srf-java |-src |-lib |-testdata or under srf we can create a folder for c++ and move everything that currently is in srf in the c++ folder except the docs and the testdata. The testdata can also contain folders with samples specific to different sequencers such as solexa, 454, solid / -srf |-c++ |-src |-test |-java |-src |-lib |-fortran (when an implementation is available - actually I think I found a fortran implementation already somewhere on the web) |-perl (when an implementation is available) |-python (when an implementation is available) |-docs |-testdata |-454-samples |-solexa-samples |-solid-samples My preference is for the latter, but I can live with either one. Or if you have a different idea just let me know where should we put this code Thanks Cristian ________________________________ From: Singh, Indresh Sent: Friday, May 02, 2008 12:19 PM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080502/3ccfe2e5/attachment-0001.htm From asims at bcgsc.ca Wed May 7 06:02:11 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Wed May 7 06:02:40 2008 Subject: [Ssrformat] SRF/SRA meeing at CSHL Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797E5@xchange1.phage.bcgsc.ca> We will meet at 1pm Thursday outside of the auditorium. I'm not expecting a lot of people so it should be easy to find a quiet spot to go from there. Asim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080507/74e54294/attachment.htm From rodarmer at ncbi.nlm.nih.gov Wed May 7 06:25:50 2008 From: rodarmer at ncbi.nlm.nih.gov (Kurt Rodarmer) Date: Wed May 7 06:26:01 2008 Subject: [Ssrformat] SRF/SRA meeing at CSHL In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797E5@xchange1.phage.bcgsc.ca> References: <86C6E520C12E52429ACBCB01546DF4D3016797E5@xchange1.phage.bcgsc.ca> Message-ID: See you there! Kurt At 6:02 AM -0700 5/7/08, Asim Siddiqui wrote: >Content-class: urn:content-classes:message >Content-Type: multipart/alternative; > boundary="----_=_NextPart_001_01C8B042.96566772" > >We will meet at 1pm Thursday outside of the auditorium. I'm not >expecting a lot of people so it should be easy to find a quiet spot >to go from there. > >Asim > > >_______________________________________________ >Ssrformat mailing list >Ssrformat@mail.bcgsc.ca >http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080507/f6d66328/attachment.htm From aleksey at ncbi.nlm.nih.gov Wed May 7 08:56:53 2008 From: aleksey at ncbi.nlm.nih.gov (Alekseyev, Vladimir (NIH/NLM/NCBI) [C]) Date: Wed May 7 08:57:03 2008 Subject: [Ssrformat] SRF/SRA meeing at CSHL In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797E5@xchange1.phage.bcgsc.ca> References: <86C6E520C12E52429ACBCB01546DF4D3016797E5@xchange1.phage.bcgsc.ca> Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43ECCFBDAE@NIHCESMLBX15.nih.gov> Hi Asim, Sounds great. We'll see you there. Vladimir From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Wednesday, May 07, 2008 9:02 AM To: ssrformat@bcgsc.ca Subject: [Ssrformat] SRF/SRA meeing at CSHL We will meet at 1pm Thursday outside of the auditorium. I'm not expecting a lot of people so it should be easy to find a quiet spot to go from there. Asim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080507/1b65fa85/attachment.htm From jkb at sanger.ac.uk Thu May 8 10:27:50 2008 From: jkb at sanger.ac.uk (James Bonfield) Date: Thu May 8 10:28:04 2008 Subject: [Ssrformat] SRF change proposal: contaminant bit marker Message-ID: <20080508172749.GI4577@sanger.ac.uk> Hello all, There's been some local (Sanger) discussions regarding what to do with data that appears to be contaminants. The issue being that we'd like to mark a trace as being a likely contaminant, but we may not want to reject it completely. Do we produce a second SRF file or do we just mark them as poor? However reusing the "Is the read bad?" bit isn't ideal as being a contaminant is an orthagonal feature (we can have good and bad quality contaminant reads). Initially I was for simply rejecting data that's labelled as contamination, but there are strong arguments for maybe keeping this for now. One issue is that filtering by low quality sometimes throws away data that is real and matches, so we have an error rate in the useful vs unuseful filtering. However that error is randomly distributed and has the impact of evenly reducing the depth by a small degree. Classifying a sequence as real or contaminant also has an error rate, but it's highly systematic and we'll have dramatic impact on the depth in some regions and zero impact elsewhere. Hence throwing the data away may not be wise. The issue of whether or not we actually submit data labelled as contaminant is another topic really. There are ethetical cases to handle in the example of a maleria sample containing some human DNA contamination. However for local alignments and storage holding a single bit to mark a sequence as a contaminant is reasonable. So the proposal: Add "Bit 3: sequence is a contaminant" to the readFlags description in section 6.5.1. Comments? James PS. I realise there's nothing in the SRF description that handles multiple reference sequences and any of the alignment issues, so a more generic extension is to indicate which ref. sequence this fragment appears to align against. However that gets too far down the alignment format discussion and is something best left for elsewhere. The proposal is simply to have another 'error' flag which may or may not be used to filter data in some way. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From asims at bcgsc.ca Fri May 9 06:46:10 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri May 9 06:46:23 2008 Subject: [Ssrformat] SRF change proposal: contaminant bit marker References: <20080508172749.GI4577@sanger.ac.uk> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797E6@xchange1.phage.bcgsc.ca> James, The proposal is a valid use of the readFlags. I think it is ok not to report why the read was marked as a contaminant in the same way that we don't indicate why a read is marked as bad. The decision is up to the data generator. In the absence another opinion, I'll add this to the spec. Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of James Bonfield Sent: Thu 08/05/2008 10:27 AM To: ssrformat@bcgsc.ca Subject: [Ssrformat] SRF change proposal: contaminant bit marker Hello all, There's been some local (Sanger) discussions regarding what to do with data that appears to be contaminants. The issue being that we'd like to mark a trace as being a likely contaminant, but we may not want to reject it completely. Do we produce a second SRF file or do we just mark them as poor? However reusing the "Is the read bad?" bit isn't ideal as being a contaminant is an orthagonal feature (we can have good and bad quality contaminant reads). Initially I was for simply rejecting data that's labelled as contamination, but there are strong arguments for maybe keeping this for now. One issue is that filtering by low quality sometimes throws away data that is real and matches, so we have an error rate in the useful vs unuseful filtering. However that error is randomly distributed and has the impact of evenly reducing the depth by a small degree. Classifying a sequence as real or contaminant also has an error rate, but it's highly systematic and we'll have dramatic impact on the depth in some regions and zero impact elsewhere. Hence throwing the data away may not be wise. The issue of whether or not we actually submit data labelled as contaminant is another topic really. There are ethetical cases to handle in the example of a maleria sample containing some human DNA contamination. However for local alignments and storage holding a single bit to mark a sequence as a contaminant is reasonable. So the proposal: Add "Bit 3: sequence is a contaminant" to the readFlags description in section 6.5.1. Comments? James PS. I realise there's nothing in the SRF description that handles multiple reference sequences and any of the alignment issues, so a more generic extension is to indicate which ref. sequence this fragment appears to align against. However that gets too far down the alignment format discussion and is something best left for elsewhere. The proposal is simply to have another 'error' flag which may or may not be used to filter data in some way. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080509/81cb7b5f/attachment.htm From shumwaym at ncbi.nlm.nih.gov Fri May 9 12:10:23 2008 From: shumwaym at ncbi.nlm.nih.gov (Shumway, Martin (NIH/NLM/NCBI) [E]) Date: Fri May 9 12:10:35 2008 Subject: [Trace] [Ssrformat] SRF change proposal: contaminant bit marker In-Reply-To: <20080508172749.GI4577@sanger.ac.uk> References: <20080508172749.GI4577@sanger.ac.uk> Message-ID: I agree with everything said here, and just think of all the scientific discoveries made from the trash bin. I would like to propose a more extensive readFlags definition in such a way that Centers can add considerable value to their datasets, without preventing the wholesale archival of run data (except in the case where the entire lane or run is scrapped due to lot failure), and that permits retrieval and reprocessing by future users. I suggest the following "quality classes", perhaps others might suggest additional or different ones: bit 1 : process trash (the Center determined that the spot is not of high quality and should not be used in applications). I think all process failure detection such as purity and chastity filtering could be represented by this flag, as these will shift from platform to platform and there is not a lot to be gained by being more specific here. bit 2 : tracking trash (the read should not have been included in this dataset) bit 3 : contaminant trash bit 4 : duplicate trash bit 5 : failed mate (the spot did not produce a valid pair of mated reads and should be treated instead as a fragment) bit 6 : short (the read's good quality range is too short to be alignable) bit 7 : illegible (the spot's bar code sequence could not be determined) bit 8 : control or calibration read Thus most users will elect to not use a read if readFlags > 0. The idea is to allow for the filtering of reads to eliminate unusable and non-random data from the dataset (such that this can be done from analyzing the sequencing data itself). In the SRA, we'd like to be able to store readFlags as a "column" that can be utilized in filtering operations, by default the Center prescribed filtering would be applied to the dataset. martin -----Original Message----- From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Thursday, May 08, 2008 1:28 PM To: ssrformat@bcgsc.ca Subject: [Trace] [Ssrformat] SRF change proposal: contaminant bit marker Hello all, There's been some local (Sanger) discussions regarding what to do with data that appears to be contaminants. The issue being that we'd like to mark a trace as being a likely contaminant, but we may not want to reject it completely. Do we produce a second SRF file or do we just mark them as poor? However reusing the "Is the read bad?" bit isn't ideal as being a contaminant is an orthagonal feature (we can have good and bad quality contaminant reads). Initially I was for simply rejecting data that's labelled as contamination, but there are strong arguments for maybe keeping this for now. One issue is that filtering by low quality sometimes throws away data that is real and matches, so we have an error rate in the useful vs unuseful filtering. However that error is randomly distributed and has the impact of evenly reducing the depth by a small degree. Classifying a sequence as real or contaminant also has an error rate, but it's highly systematic and we'll have dramatic impact on the depth in some regions and zero impact elsewhere. Hence throwing the data away may not be wise. The issue of whether or not we actually submit data labelled as contaminant is another topic really. There are ethetical cases to handle in the example of a maleria sample containing some human DNA contamination. However for local alignments and storage holding a single bit to mark a sequence as a contaminant is reasonable. So the proposal: Add "Bit 3: sequence is a contaminant" to the readFlags description in section 6.5.1. Comments? James PS. I realise there's nothing in the SRF description that handles multiple reference sequences and any of the alignment issues, so a more generic extension is to indicate which ref. sequence this fragment appears to align against. However that gets too far down the alignment format discussion and is something best left for elsewhere. The proposal is simply to have another 'error' flag which may or may not be used to filter data in some way. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat _______________________________________________ Trace mailing list Trace@ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/mailman/listinfo/trace From asims at bcgsc.ca Sat May 10 12:09:45 2008 From: asims at bcgsc.ca (Asim Siddiqui) Date: Sat May 10 12:10:00 2008 Subject: [Trace] [Ssrformat] SRF change proposal: contaminant bit marker References: <20080508172749.GI4577@sanger.ac.uk> Message-ID: <86C6E520C12E52429ACBCB01546DF4D3016797E7@xchange1.phage.bcgsc.ca> Hi Martin, I think this is a good idea. I'd like to keep the use of the existing readFlags limited to uses that apply to all reads and bits 1-3,8 fall into this category. Bits 5 and 7 are application specific and we should handle those in a different manner. Bit 6 has a meaning that will change as the reference is changed and so - arguably this is true of bit 3 as well, so is the submitter required to provide information on the applied filtering. For bit6, I would suggest that all reads are submitted and their alignability should be determined downstream. Could you expand on what you mean by bit 4, duplicate read? Thanks, Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Shumway, Martin (NIH/NLM/NCBI) [E] Sent: Fri 09/05/2008 12:10 PM To: James Bonfield; ssrformat@bcgsc.ca Subject: RE: [Trace] [Ssrformat] SRF change proposal: contaminant bit marker I agree with everything said here, and just think of all the scientific discoveries made from the trash bin. I would like to propose a more extensive readFlags definition in such a way that Centers can add considerable value to their datasets, without preventing the wholesale archival of run data (except in the case where the entire lane or run is scrapped due to lot failure), and that permits retrieval and reprocessing by future users. I suggest the following "quality classes", perhaps others might suggest additional or different ones: bit 1 : process trash (the Center determined that the spot is not of high quality and should not be used in applications). I think all process failure detection such as purity and chastity filtering could be represented by this flag, as these will shift from platform to platform and there is not a lot to be gained by being more specific here. bit 2 : tracking trash (the read should not have been included in this dataset) bit 3 : contaminant trash bit 4 : duplicate trash bit 5 : failed mate (the spot did not produce a valid pair of mated reads and should be treated instead as a fragment) bit 6 : short (the read's good quality range is too short to be alignable) bit 7 : illegible (the spot's bar code sequence could not be determined) bit 8 : control or calibration read Thus most users will elect to not use a read if readFlags > 0. The idea is to allow for the filtering of reads to eliminate unusable and non-random data from the dataset (such that this can be done from analyzing the sequencing data itself). In the SRA, we'd like to be able to store readFlags as a "column" that can be utilized in filtering operations, by default the Center prescribed filtering would be applied to the dataset. martin -----Original Message----- From: James Bonfield [mailto:jkb@sanger.ac.uk] Sent: Thursday, May 08, 2008 1:28 PM To: ssrformat@bcgsc.ca Subject: [Trace] [Ssrformat] SRF change proposal: contaminant bit marker Hello all, There's been some local (Sanger) discussions regarding what to do with data that appears to be contaminants. The issue being that we'd like to mark a trace as being a likely contaminant, but we may not want to reject it completely. Do we produce a second SRF file or do we just mark them as poor? However reusing the "Is the read bad?" bit isn't ideal as being a contaminant is an orthagonal feature (we can have good and bad quality contaminant reads). Initially I was for simply rejecting data that's labelled as contamination, but there are strong arguments for maybe keeping this for now. One issue is that filtering by low quality sometimes throws away data that is real and matches, so we have an error rate in the useful vs unuseful filtering. However that error is randomly distributed and has the impact of evenly reducing the depth by a small degree. Classifying a sequence as real or contaminant also has an error rate, but it's highly systematic and we'll have dramatic impact on the depth in some regions and zero impact elsewhere. Hence throwing the data away may not be wise. The issue of whether or not we actually submit data labelled as contaminant is another topic really. There are ethetical cases to handle in the example of a maleria sample containing some human DNA contamination. However for local alignments and storage holding a single bit to mark a sequence as a contaminant is reasonable. So the proposal: Add "Bit 3: sequence is a contaminant" to the readFlags description in section 6.5.1. Comments? James PS. I realise there's nothing in the SRF description that handles multiple reference sequences and any of the alignment issues, so a more generic extension is to indicate which ref. sequence this fragment appears to align against. However that gets too far down the alignment format discussion and is something best left for elsewhere. The proposal is simply to have another 'error' flag which may or may not be used to filter data in some way. -- James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat _______________________________________________ Trace mailing list Trace@ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/mailman/listinfo/trace _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080510/543b9c37/attachment.htm From aleksey at ncbi.nlm.nih.gov Mon May 12 11:15:28 2008 From: aleksey at ncbi.nlm.nih.gov (Alekseyev, Vladimir (NIH/NLM/NCBI) [C]) Date: Mon May 12 11:15:39 2008 Subject: [Ssrformat] NCBI Short read format Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43ECECD750@NIHCESMLBX15.nih.gov> Last Thursday, May 8, we presented the NCBI Short Read Format at the CSHL meeting. Many thanks to Asim for his gracious invitation and for organizing this. There were representatives from ABI, BCGSC, EBI, WUSTL. We presented the salient details of the format, focusing on the items below, applications and a road map for near term development. The presentation and accompanying documents are available at: http://www.valexllc.com/d/Short_Read_Archive_Format.pdf http://www.valexllc.com/d/Short_Read_Archive_solutions.pps We are preparing the initial public release of NCBI Short Read Format, including the open format description, a C API, and supporting tools for conversion between existing public formats. As explained in the presentation, this near term release is NCBI's currently operating internal standard generalized for external environments. We will keep everyone posted on our progress and expected release dates. Here are some features of this approach: - column based. Column is not a type. Arbitrary columns can be added upon post processing or at any time; - structural integrity. Upon processing the unnecessary columns can be removed to minimize the storage; - extensively indexed (direct and projection indexing); - compact; - fast. The records can be accessed by an integer ID or by name; - flexible; - accessible across multiple volumes; - plug and play. Submissions become readily accessible upon uploads; Thanks again, Vladimir & Kurt & NCBI Trace team From cgoina at jcvi.org Mon May 19 13:56:29 2008 From: cgoina at jcvi.org (Goina, Cristian) Date: Mon May 19 15:05:44 2008 Subject: [Ssrformat] SOLID SRF implementation In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3016797E1@xchange1.phage.bcgsc.ca> References: <86C6E520C12E52429ACBCB01546DF4D3016797E1@xchange1.phage.bcgsc.ca> Message-ID: <3EC2264E4927524DB3F1C0425831CCA5113879@EXCHANGE.TIGR.ORG> Asim, I just checked in the srf-java code on sourceforge.Under srf directory I created an srf_java directory that contains the java implementation which on its turn has two modules - one that contains common srf code (common-srf) and one that contains solid specific code (solid-srf). The idea was that if other vendor specific java implementation is created then that could be easily added in a new module under srf_java/. Also under solid-srf/docs there is a word document that explains a little bit the usage of the current ZTR chunks for SOLiD data. For now I left the user defined chunk for color coded sequences instead of going with the BASE chunk - I'd rather wait until we clearly specify the metadata paramenter that defines the base calls encoding - and the bases are still coded using integer values 0,1,2,3 (plus 0xf for unknowns) instead of the corresponding ASCII values for '0','1','2','3','.'. If you think that the char values should be preferred over the integers I have no problem with that, but I've been thinking about and adding the last base of the primer to the sequence may actually increase the size of the chunk. I implemented Huffman compression for basecall chunks instead of using only a nibble per basecall and the more possible codes we add for the chunk bytes the less efficient the compression. I really don't see how only two bits per base call could be used because it is possible to have unknown transitions in the read's sequence and if we decide we want those as well then two bits is definitely not enough to code five values (0,1,2,3,.) - and that is without storing the last base from the primer. Based on the discussions I had with Jonathan Manning I no longer use user defined ZTR chunks for scaled intensities but it should be noted that SMP4 chunks used for that are not quite orthodox either since the spec mentions that SMP4 contain 16 bit integers not 32 bit IEEE 754 floating point numbers. I'm in the process of pulling some E-coli data from our archive and once I have that I will add a directory with a small data sample (don't worry it won't be the entire run - just about 1000 reads from the run) Regards Cristian Goina ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Tuesday, May 06, 2008 10:25 AM To: Goina, Cristian Subject: RE: [Ssrformat] SOLID SRF implementation Hi Cristian, How are things going? I'm going to meet Jonanthan Manning at CSHL later this week - if there is anything that I need to discuss with him relating to SRF let me know. Asim ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Fri 02/05/2008 2:58 PM To: Asim Siddiqui; Singh, Indresh; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation If the license is APACHE I think you should be fine. --cg ________________________________ From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 5:27 PM To: Goina, Cristian; Singh, Indresh; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation I prefer the latter option as well, so let's go with that. The top level directories are modules, so we'll need to create two new modules (C++ and java) and deprecate/move/delete the old one. I am revising the c++ code as well and hope to have something up on the site around Wednesday. BTW the OS license for code is the APACHE 2.0 license. Asim ________________________________ From: Goina, Cristian [mailto:cgoina@jcvi.org] Sent: Fri 02/05/2008 2:02 PM To: Singh, Indresh; Asim Siddiqui; ssrformat@bcgsc.ca Cc: Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Asim, I've been looking at the SRF sourceforge project site and trying to decide where to put this code. Currently there is / -srf |-docs |-src |-test I was thinking to add my java implementation at the same level with srf and then later on docs could be moved at the same level as well / -srf |-docs |-src |-test -srf-java |-src |-lib |-testdata or under srf we can create a folder for c++ and move everything that currently is in srf in the c++ folder except the docs and the testdata. The testdata can also contain folders with samples specific to different sequencers such as solexa, 454, solid / -srf |-c++ |-src |-test |-java |-src |-lib |-fortran (when an implementation is available - actually I think I found a fortran implementation already somewhere on the web) |-perl (when an implementation is available) |-python (when an implementation is available) |-docs |-testdata |-454-samples |-solexa-samples |-solid-samples My preference is for the latter, but I can live with either one. Or if you have a different idea just let me know where should we put this code Thanks Cristian ________________________________ From: Singh, Indresh Sent: Friday, May 02, 2008 12:19 PM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Christian is working on SOLID SRF so please add him on the developers list. He is formalizing the SOLID SRF document and he should have that soon for review. Regards, Indresh From: Asim Siddiqui [mailto:asims@bcgsc.ca] Sent: Friday, May 02, 2008 12:14 PM To: Singh, Indresh; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Indresh, I think it makes sense to create a different set of directories in the existing project rather than start a new project as well. I can provide you with developer access to the repositories. One request: can you provide a description of your use of ZTR blocks for inclusion in the spec? James (or Come), if I cold get the same for Illumina that would be great. Asim ________________________________ From: Singh, Indresh [mailto:ISingh@jcvi.org] Sent: Fri 02/05/2008 8:23 AM To: Asim Siddiqui; ssrformat@bcgsc.ca Cc: Goina, Cristian; Jonathan.Manning@appliedbiosystems.com Subject: RE: [Ssrformat] SOLID SRF implementation Hi Asim, It was nice talking to you at BIO-IT. We are very close to finishing the SOLID SRF implementation and we would like to submit the source code in sourceforge, but the question is where it should be submitted as all the code for SOLID SRF is in Java. Some options are 1. Create a different tree (/cvsroot/srf/java) in existing CVS (cvsroot/srf) for Java implementation. 2. Create different sourceforge project. I don't like this personally as I would like to keep all the SRF implementation in one project. Regards, Indresh From: ssrformat-bounces@mail.bcgsc.ca [mailto:ssrformat-bounces@mail.bcgsc.ca] On Behalf Of Asim Siddiqui Sent: Friday, May 02, 2008 12:10 AM To: ctoma@broad.mit.edu; ssrformat@bcgsc.ca Subject: RE: [Ssrformat] Read header readFlags byte Camil. Thanks - I'll add text to the spec to clarify that bit 0 is the MSB. (Your suggested change to solexa2srf looks fine to me, though James owns that code.) Asim ________________________________ From: ssrformat-bounces@mail.bcgsc.ca on behalf of Camil Toma Sent: Thu 01/05/2008 2:58 PM To: ssrformat@bcgsc.ca Subject: [Ssrformat] Read header readFlags byte Let me know if this sounds reasonable. The motivation for this is that the Broad Institute wants to submit 1000 genomes SRFs containing all solexa reads regardless of chastity filter, whereas the SRA requires that if we submit bad reads they should at least marked as such. Not sure if any of the other centers are interested in this functionality or if they just filter out the reads. With regards to the readFlags byte in the read header, the SRF spec states: Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true. Bit 1: Is the read bad? Bit 2: Has the read been withdrawn? Bits 3-5: reserved Bits 6-8: user definable It would be more clear if the actual bit numbering convention used is specified. I suggest most significant bit 0 numbering. Do we want to change this to say something like this? Field Description ------------------------------------------------------------------------ -------- readFlags Each bit may be set to indicate a status for the read. The bits are set to 1 if the flag is true using "MSB 0" numbering: Bit 0: Is the read bad? Bit 1: Has the read been withdrawn? Bits 2-4: reserved Bits 5-7: user definable For example, if we want to mark a read as bad, the bits will be 1000 0000. I'd like to add some functionality to the solexa2srf program to support creating SRF files from solexa data where the bad reads are included in the SRF and the readFlags byte is set to the correct value to specify this. Currently the solexa2srf only supports filtering of reads based on the chastity value, which is specified via the "-c float" option. I'd like to add a "-C float" option to mean "Don't filter reads that fail the chastity test, mark those reads as bad". -c and -C will be exclusive options. I'd also like to add -C to the srf2solexa program to allow for filtering of reads on decompression back to native solexa. So, for example, one could create an SRF with the option "-C 0.60", resulting in an SRF that will contain all solexa reads, only that those that fall below the 0.60 chastity threshold will be marked as bad in their read header. One could then invoke: 1) srf2solexa my_data.srf This will cause all the reads to be uncompressed, regardless of the readFlags setting. 2) srf2solexa -C my_data.srf This will only uncompress those reads that are marked good. That is, bit 0 is set to value 0. Thanks, -camil _______________________________________________ Ssrformat mailing list Ssrformat@mail.bcgsc.ca http://www.bcgsc.ca/mailman/listinfo/ssrformat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080519/6b249026/attachment-0001.htm From JEmhoff at helicosbio.com Tue May 20 13:15:11 2008 From: JEmhoff at helicosbio.com (John Emhoff) Date: Tue May 20 13:15:18 2008 Subject: [Ssrformat] iolib memory leak and patch Message-ID: <413BC2C30D426E45A512157E7C9BFB4301267178@wildcat.HBSC.local> James, It looks like construct_trace_name in srf.c leaks memory each time it's called: there is no matching block_destroy to the block_create at the top. It's a pretty simple fix -- here's a patch! -- John --- srf.c 2008-05-20 15:59:25.000000000 -0400 +++ patched_srf.c 2008-05-20 16:01:22.000000000 -0400 @@ -789,7 +789,11 @@ if (out_pos < name_len-1) \ name[out_pos++] = (c); \ else \ - return name_len; + { \ + block_destroy(blk, 1);\ + return name_len; \ + } + int construct_trace_name(char *fmt, unsigned char *suffix, char *name, int name_len) { @@ -939,7 +943,7 @@ } emit('\0'); - + block_destroy(blk,1); return out_pos; }