[Ssrformat] fastq format

James Bonfield jkb at sanger.ac.uk
Thu Mar 1 10:04:30 PST 2007


Just to follow up on the meeting, the fastq format is as follows:

@name1
dna-bases
+name1
confidence-values
@name2
dna-bases
+name2
confidence-values
...

The dna-bases and confidence-values are the same length. We store one
byte per confidence in exactly the same was as one byte per base.
It's basically the confidence plus octal 41. Solexa used a different
value (I forget what) to cope with the ability to store negative
scores and still be printable. It's recommended to store dna and
confidence values on a single line. If not then the fact that "+" is a
valid character in the confidence value encodings causes all sorts of
format issues.

Eg an example of a solexa read in fastq format:

@slxa_0013_1_0001_24
ACAAAAATCACAAGCATTCTTATACACC
+slxa_0013_1_0001_24
??????????????????:??<?<-6%.


It's easy to deal with; the perl code we use to convert solexa
*seq.txt and *prb.txt files into fastq has at it's core (I'll send the
entire thing if people want it):

-----------------------------------------------------------------------------
    # Compute log-odds to fastq-scale mapping
    my @rescale;
    for (my $i = -100; $i < 100; $i++) {
        $rescale[$i+100] = chr(041+int(10*log(1+10**($i/10))/log(10)+.499));
    }
    my %hmap = ('A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, '.' => 0);

# ... cut ...

    # Format of seq is <lane> <tile> <xpos> <ypos> <sequence>
    while (<$seqfh>) {
        my $q = (<$qualfh>);

        # Decode quality into 1 per called base
        my $seq = [split(/\s+/,$_)]->[4];
        my $i = 0;
        my @qa = split('\t', $q);
        my @qual = map {[split()]->[$hmap{substr($seq, $i++, 1)}]} @qa;

        # Map from log-odds to Phred scale
        @qual = map { $rescale[$_+100] } @qual;

        $_ = $name . "_" . ++$count;
        if (defined($op) && $op) {
            eval $op;
            die if $@;
        }

        $seq =~ s/[^ACGT]/N/gi;
        print "\@$_\n" . substr($seq, $opts{trim}) .
             "\n+$_\n" . substr("@qual", $opts{trim}) . "\n";
    }
-----------------------------------------------------------------------------

I could envisage a perl API into our container format being along the
similar lines, either with a "get_read_by_name" query when in
random-access mode or a "next" query when in streaming mode.

Pros vs fasta/fasta.qual
------------------------

It's compact.

The sequence and quality values are in the same file and adjacent to
one another for ease of parsing.

Trivial handling. Perl pack/unpack makes rapid work of handling the
packed confidence values too.


Cons
----

The name appears twice. Such a waste! I computed it's approx 19%
overhead on one of our recent solexa runs. (Fasta.qual suffers the
same problem of course.)

Multi-line versions are broken. '@' is octal 100 in ASCII meaning it
also encodes a confidence value of 31. If this appears at the start of
a new line then it is interpreted as a new read-name.

Limiting ourselves to one-line only per sequence or confidence makes
it hard to write code that handles any length of input sequence,
although it's perfectly adequate for real data (just not as a general
format to inport consensus sequences, references, etc).

Confidence values are mandatory.


Suggestion for fastq2?
======================

If we're being pushed to quickly produce a new format for submittal of
sequence and phred-scaled confidence then son-of-fastq would be a
good start. Eg:

@name1
+DNA
+DNA
+DNA
-Confidence
-Confidence
-Confidence
@name2
+DNA
+DNA
+DNA
-Confidence
-Confidence
-Confidence

The theory is each line has a single starting character of "@", "+" or
"-" (adjustable to whatever we want of course). Multiple lines are
still allowed and there's no clash of meaning. Newlines are never a
legal confidence code and the presence ALWAYS of a starting character
per line makes the meaning unambiguous. For short-read data one line
is sufficient though.

Arguably the "-" confidence lines could be optional, but if they do
exist they should be precisely the same length (after concatenation)
as the 

If we wish to enforce only one line per dna/confidence, then it can be
simpler still. Eg:

@name1
DNA
Confidence
@name2
DNA
Confidence
...

Comments?

James

-- 
James Bonfield (jkb at sanger.ac.uk)
A Staden Package developer: https://sourceforge.net/projects/staden/


More information about the Ssrformat mailing list