[Ssrformat] 454 Read Information
Knight, James
james.knight at roche.com
Mon Aug 11 11:28:13 PDT 2008
The following message lists the fields of information currently
generated and/or stored with 454 reads, and are what the SRF entries for
454 reads are currently going to support. I've also included whether
the field will, if possible, be stored in the data block header. But,
I've not included any encoding information yet (i.e., where or how the
data will be stored into the entry), because it is really not clear how
to specify that the SRF read entry is a 454 read or to setup the mapping
between my fields and eventual ZTR chunks (so that someone trying to
read a file I create will know what is what and how to read it).
In terms of grouping together reads, the plan is to use plate regions as
the major grouping of reads, meaning that the reads from a region will
be grouped together by the conversion software into a data block header
and set of data blocks. The fields common to a region will, if
possible, be placed in the data block header, so that only the
read-specific fields are stored in each data block. The software will
allow for the writing of a file containing multiple region information
(just as the sfftools package allows for SFF files), and if this occurs,
each regions' reads will get their own data block header.
The fields that may be present for a read are the following (including
fields marked as "virtual" meaning that they are not separately stored,
but computed from some other stored field):
* Accession: string containing the read's accession [virtual
if the 454 standard accessions are used and a UniqueIdPrefix/readId pair
are found, but stored as the readId if the accessions do not follow the
454 standard naming convention]
o Read-specific if stored
* UniqueIdPrefix: string containing the run and region portion
of the 454 standard accessions, if standard accessions are present
o Common to a region
* ReadId: string containing the read-specific suffix of the 454
standard accessions, or the complete read accession if the accessions
are not 454 standard
o Read-specific
* RunTime: timestamp of the run [virtual, computed from the 454
standard accession, if present]
* Region: the region of the plate [virtual, computed from the
454 standard accession, if present]
o Question: Should this value be explicitly stored now?
* X: the pixel X position of the read [virtual, computed from
the 454 standard accession, if present]
* Y: the pixel Y position of the read [virtual, computed from
the 454 standard accession, if present]
* Basecaller: string containing the basecaller used to generate
the sequences
o Common to a region (but not necessarily the whole file)
* Version: string listing the software version used to generate
the sequences
o Common to a region (but not necessarily the whole file)
* RunName: string containing the R_... folder name for the run
o Common to a region, present if the SFF manifest is used
* AnalysisName: string containing the D_... folder name for the
analysis directory
o Common to a region, present if the SFF manifest is used
* Path: string containing the path to the analysis directory
when the run was processed through the 454 image/signal processing
software
o Common to a region, present if the SFF manifest is used
* FlowOrder: string containing the nucleotide order of the
flows
o Common to a region
* NumFlows: integer containing the number of flows in each
flowgram
o Common to a region
* Flowgram: the read's flow signal values
o Read-specific
* Basecalls: the read's basecalls
o Read-specific
* QualityScores: the read's quality scores
o Read-specific
* FlowIndex: the base-to-flow mapping of which flow was used to
call each base
o Read-specific
* QualityClip: clippoints of where the read is clipped for
quality
o Read-specific
* Sub-Read Regions: list of sub-read identifiers and read
regions for the sections of a read
o The regions are read-specific, but a common template containing
the following possible sections
* Key: the initial 4 base key for the read
* MID5: a 5' barcode (or multiplex ID (MID))
* Insert: if the read is not a paired-end read, the region of
the "real" read
* Left: if the read is a paired-end read, the left half of the
pair
* Linker: the paired-end linker, for paired-end reads
* Right: if the read is a paired-end read, the right half of
the pair
* MID3: a 3' barcode (or multiplex ID (MID))
* Primer: the 3' sequencing primer
* BadReadFlag: readFlag set if the read is "bad"
o This will likely come into use, but the first version will
likely not use it
* ContaminantFlag: readFlag set if the read is a contaminant
o This will be set for all control fragment reads (if they appear
in an SRF file), plus all contaminants of paired-end reads (i.e., puc19
vector reads)
In addition to these core fields, I remember part of the conversation
involving storing in the file the "artificial" sequences used to
identify linkers and primers and such, so there should be an extra set
of fields for the following:
* Keys used for the region
* 3' sequencing primers used for the region
* Paired-end linkers used for the region
* 5' barcodes used for the region
* 3' barcodes used for the region
Is there any information that should be included, but I missed? The
next email will start laying out the details of how this will be encoded
into the SRF file.
Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20080811/53e19054/attachment-0001.htm
More information about the Ssrformat
mailing list