[Ssrformat] Re: Indexes for Incremental blocks
Asim Siddiqui
asims at bcgsc.ca
Fri Nov 30 10:05:21 PST 2007
Hi,
I'm ok with 32 bits - as you mention it is still shorter than equivalent ascii. I like the idea of switching to hex for display as well.
An incremental order is definitely not required and read ids can be written out of order.
For ultimate future proofing, we could make the field variable length with the size in the data header block - is this overkill? This will work for indexing as the index knows where all the data header blocks are (aside: caching data header block info will probably aid indexing speeds in any case).
Another small change that will mesh well with this is limiting the read id to numerical (hex) values (with the exception of the manufacturer id prefix). Anyone see any problems with this?
Asim
-----Original Message-----
From: James Bonfield [mailto:jkb at sanger.ac.uk]
Sent: Fri 30/11/2007 3:30 AM
To: Asim Siddiqui
Cc: ssrformat at bcgsc.ca
Subject: Re: [Ssrformat] Re: Indexes for Incremental blocks
Hello all,
I've asked around locally (Sanger) and there's some reluctance to
switching to 16-bit suffixes. As a comprimise one suggestion was to go
with a 32-bit readId.
It may sound like overkill, but the reason is that it offers
flexibility while also not being a massive size increase
overall. Those extra 2 bytes per trace cost us under 1% growth in the
data compared to the current 16-bit proposal. However for those 2
bytes we gain several things.
1) 65536 isn't sufficient if we wanted to keep the concept of 1 tile =
1 "block" of data. This isn't a critical issue as we can generate
multiple data block headers from a single tile, but it may be
preferable for general house-keeping and naming of data.
2) There's still often a desire to keep track of X/Y coordinates in
order to keep track of QC and averaged metrics based on regions of
an image.
3) Note that a fixed 32-bit value is still typically 2-3 bytes smaller
than the ascii format readID we've been locally producing. So, for
us at least, it still is saving overall.
A 24-bit value would mean instead of an incrementing counter we could
just produce the value from an arbitrary pair of unique 12-bit numbers
(eg coordinates from a 4096x4096 pixel image). A 32-bit value offers
a bit more flexibility and future proofing.
The point being that we shouldn't define this value to be an
incremental value, but simply a unique value of a specified size.
If you're worried about the additional space used up by 2 extra bytes
per trace then consider reducing the ztrBlobSize from uint32 to
something smaller. There's no way we'll get an individual ZTR file of
16Mb in size (24-bit) and it's rare even for capillary data to break
64K ZTR size let alone the new technology data. Also as previously
stated the ZTR specification itself is pretty inefficient for such
small data, although that's likely the subject as a later round of
optimisations.
James
--
James Bonfield (jkb at sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.bcgsc.ca/pipermail/ssrformat/attachments/20071130/38c84bfc/attachment.htm
More information about the Ssrformat
mailing list