From james.knight at roche.com Thu Aug 7 10:50:40 2008 From: james.knight at roche.com (Knight, James) Date: Thu Aug 7 10:51:14 2008 Subject: [Ssrformat] Questions in prep for SFF conversion Message-ID: <211F72AAC0FAAA4D8C4C89FF326D43F599F57D@rnumsem04.nala.roche.com> I'm beginning the definition and implementation of the 454 SRF format and SFF-SRF conversion software, and I have several questions right now. First, I'm looking through the two specs (SRF and ZTR) and the srf2illumina/srf2fastq code, and I can't seem to figure out where you explicitly tell the type for each read (454, Illumina, ABI, ...) when you are encoding the data block headers and data blocks. The code for srf2solexa.c and srf2fastq.c seem to assume that the SRF file they are being given is one that contains only those reads. The only part of the spec that mentions read types is the reservation of uniqueIdPrefix's for the various vendors, but these accession-related fields can be rewritten. If, for example, I'm getting some SRF file from the Broad Institute, where they have hidden all of the vendor-specific accnos, how do I tell what type or types of reads are in the file? What am I missing? Taking that one step farther, what are the guidelines for how to encode the read info into the file. Do the ZTR chunk types define what the data is going to be and are the only modes for storing read info (i.e., for all vendors, BASE chunks must be where the sequence is kept, BPOS chunks are the base-to-sample/flow indexes, ...), or is it more up to me to define the mapping between chunks and data fields? Second, James, does the addition of new data format types to the ZTR code result in a major release change to the ZTR code? As good as generic compression schemes are, I'm more of a proponent of content-aware compression, and would like to use some of the techniques that were in the SRA documentation and/or I've thought of in the encoding of the flowgram data. Specifically, I'd like to include data formats that handle the (1) limited precision floating point to integer conversions and (2) integer to X bit conversions (i.e., storing each value in X bits, with an overflow bit value to signal higher integer values, so if you have an 8 bit encoding, values 0 to 254 are stored as one 8 bit value, values 255 to 509 are stored as two 8 bit values (255 plus 0-254), and so on). If those two data formats were part of the spec, and the formats were combinable (which I thought you had said they were), then, for example, the flow-to-base values could be encoded using first a delta format converter, then the integer to 2 bit converter, and the flowgram values could be encoded as the limited precision converter followed (I was thinking) by a integer to 8 bit converter (and then we could throw on a Huffman converter, if it helps). I'll be following up in the coming days with the defintions and code. The contents of the 454 read SRF entries will contain the same information as in the SFF files (unless there are objections), plus the region-based definitions of the reads (the converter code will have the tools to find and mark the key, 3' sequencing primer, paired-end linker, the two halves of paired-end reads, any multiplex IDs, ...). Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080807/66774406/attachment-0001.htm From james.knight at roche.com Mon Aug 11 11:28:13 2008 From: james.knight at roche.com (Knight, James) Date: Mon Sep 1 17:35:20 2008 Subject: [Ssrformat] 454 Read Information Message-ID: <211F72AAC0FAAA4D8C4C89FF326D43F59F72F0@rnumsem04.nala.roche.com> The following message lists the fields of information currently generated and/or stored with 454 reads, and are what the SRF entries for 454 reads are currently going to support. I've also included whether the field will, if possible, be stored in the data block header. But, I've not included any encoding information yet (i.e., where or how the data will be stored into the entry), because it is really not clear how to specify that the SRF read entry is a 454 read or to setup the mapping between my fields and eventual ZTR chunks (so that someone trying to read a file I create will know what is what and how to read it). In terms of grouping together reads, the plan is to use plate regions as the major grouping of reads, meaning that the reads from a region will be grouped together by the conversion software into a data block header and set of data blocks. The fields common to a region will, if possible, be placed in the data block header, so that only the read-specific fields are stored in each data block. The software will allow for the writing of a file containing multiple region information (just as the sfftools package allows for SFF files), and if this occurs, each regions' reads will get their own data block header. The fields that may be present for a read are the following (including fields marked as "virtual" meaning that they are not separately stored, but computed from some other stored field): * Accession: string containing the read's accession [virtual if the 454 standard accessions are used and a UniqueIdPrefix/readId pair are found, but stored as the readId if the accessions do not follow the 454 standard naming convention] o Read-specific if stored * UniqueIdPrefix: string containing the run and region portion of the 454 standard accessions, if standard accessions are present o Common to a region * ReadId: string containing the read-specific suffix of the 454 standard accessions, or the complete read accession if the accessions are not 454 standard o Read-specific * RunTime: timestamp of the run [virtual, computed from the 454 standard accession, if present] * Region: the region of the plate [virtual, computed from the 454 standard accession, if present] o Question: Should this value be explicitly stored now? * X: the pixel X position of the read [virtual, computed from the 454 standard accession, if present] * Y: the pixel Y position of the read [virtual, computed from the 454 standard accession, if present] * Basecaller: string containing the basecaller used to generate the sequences o Common to a region (but not necessarily the whole file) * Version: string listing the software version used to generate the sequences o Common to a region (but not necessarily the whole file) * RunName: string containing the R_... folder name for the run o Common to a region, present if the SFF manifest is used * AnalysisName: string containing the D_... folder name for the analysis directory o Common to a region, present if the SFF manifest is used * Path: string containing the path to the analysis directory when the run was processed through the 454 image/signal processing software o Common to a region, present if the SFF manifest is used * FlowOrder: string containing the nucleotide order of the flows o Common to a region * NumFlows: integer containing the number of flows in each flowgram o Common to a region * Flowgram: the read's flow signal values o Read-specific * Basecalls: the read's basecalls o Read-specific * QualityScores: the read's quality scores o Read-specific * FlowIndex: the base-to-flow mapping of which flow was used to call each base o Read-specific * QualityClip: clippoints of where the read is clipped for quality o Read-specific * Sub-Read Regions: list of sub-read identifiers and read regions for the sections of a read o The regions are read-specific, but a common template containing the following possible sections * Key: the initial 4 base key for the read * MID5: a 5' barcode (or multiplex ID (MID)) * Insert: if the read is not a paired-end read, the region of the "real" read * Left: if the read is a paired-end read, the left half of the pair * Linker: the paired-end linker, for paired-end reads * Right: if the read is a paired-end read, the right half of the pair * MID3: a 3' barcode (or multiplex ID (MID)) * Primer: the 3' sequencing primer * BadReadFlag: readFlag set if the read is "bad" o This will likely come into use, but the first version will likely not use it * ContaminantFlag: readFlag set if the read is a contaminant o This will be set for all control fragment reads (if they appear in an SRF file), plus all contaminants of paired-end reads (i.e., puc19 vector reads) In addition to these core fields, I remember part of the conversation involving storing in the file the "artificial" sequences used to identify linkers and primers and such, so there should be an extra set of fields for the following: * Keys used for the region * 3' sequencing primers used for the region * Paired-end linkers used for the region * 5' barcodes used for the region * 3' barcodes used for the region Is there any information that should be included, but I missed? The next email will start laying out the details of how this will be encoded into the SRF file. Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080811/53e19054/attachment-0001.htm