Posters: Advances in Genome Biology and Technology (AGBT) 2003
Posters presented at the Advances in Genome Biology and Technology (AGBT) meeting by the Genome Sciences Centre are available.
Verification of Drosophila Sequence Assembly using restriction digest BAC fingerprints derived from multiple enzymes
Krzywinski M, Schein J, Chiu R, Bosdet I, Mathewson C, Wye N, Barber S, Brown-John M, Chand S, Cloutier A, Masson A, Mayo M, Olson T, Jones S, Hoskins R, Celniker S, Rubin G, Marra M
Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Dept of Molecular and Cellular Biology and Howard Hughes Medical Institute, UC Berkeley, Berkeley, CA, USA
The annotated D. melanogaster genomic sequence is currently in its third revision (Release 3) and covers nearly all of the 120 Mb euchromatic DNA. The sequence assembly is curated by the Berkeley Drosophila Genome Project (BDGP) and comprised of data produced at Celera Genomics, Genoscope, Lawrence-Berkeley National Labs, and the Human Genome Sequencing Center at the Baylor College of Medicine. Drosophila continues to play a major role in providing a model for inheritance and gene interaction and a high quality assembly is required to ensure accuracy of sequence-based analysis. To this end, we have developed an automated data analysis and visualization pipeline for verification of the sequence assembly using multiple restriction enzyme digests of tiling path BAC clones. Various types of repeat regions produce incorrect, but self-consistent, sequence assemblies. These errors are very difficult to spot without an external validation method. The fingerprint verification method offers several benefits: the sequence is verified by an independent laboratory process, the fingerprints are robust in elucidating repeat elements and the data processing pipeline is extensible and can be adapted to any sequence data.
A set of 988 tiling path clones spanning the euchromatic portion of the genome were selected. Each clone was independently fingerprinted using 5 different restriction enzymes. The enzymes were chosen to maximize coverage of the sequence with fragments in the size range of 1-20 kb to facilitate accurate sizing. The enzymes selected were ApaLI, BamHI, EcoRI, HindIII and XhoI. This combination provides coverage by at least two, three and four optimally-sized fragments for 99.9%, 98% and 87% of the sequence, respectively.
An in silico fingerprint of each clone was derived from the sequence and compared to its experimental counterpart using a Needleman-Wunsch alignment and a 2% fragment size tolerance. Each base of the sequence is assigned a verification depth that corresponds to the number of experimentally verified in silico fragments containing that sequence location. The average verification depth is used as a measure of overall verification. We have devised various figures of merit to identify clones with unverified subsequences and to categorize the discrepancies. An interactive web-based system has been created to visualize verification coverage. To date, we have discovered 105 BACs (12% of 891 active BACs in our validation pipeline, see Figure 10) whose validation profile indicates a possible sequence assembly error in regions totalling 709kb. In 29 of these cases (3.3%, seven are Phase 2), within regions spanning 274kb, the inconsistency was verified and the assembly construction in these regions will be thoroughly rechecked. The other 76 BACs require additional digests to validate the sequence assembly because the regions of inconsistencies are represented by bands that are too large (> 30kb) or are too numerous (>3X copy number). The method described here will also be applied to verification of heterochromatic DNA sequence, which is being generated using smaller clones. We anticipate that this fingerprint-based sequence verification methodology can positively impact the final sequence assembly quality of oth er organisms such as human, mouse and rat.
A Set of Rearrayed BAC Clones spanning the human genome
Krzywinski M, Bosdet I, Smailus D, Mathewson C, Wye N, Barber S, Brown-John M, Chand S, Cloutier A Masson A, Mayo M, Olson T, Lam W, MacAuley C, Osoegawa K, Zhao S, de Jong PJ, Schein J, Jones S, Marra M
Childrens Hospital Oakland Research Institute, Oakland, CA, USA
The Institute for Genomic Research, Rockville, MD, USA
From the human fingerprint map constructed at Washington University Genome Sequencing Center, we have selected a set of 32,433 BACs that span the human genome. The purpose of the clone set is to serve as a genome-ordered set of probes for FISH and microarray-based BAC CGH experiments. The comprehensive coverage of this clone set makes it a valuable asset in both research and clinical contexts, in the search for understanding and detection of cancer-related chromosomal and expression alterations.
The clones have been sampled from RPCI-11/13 (94%) and Caltech-D (6%) libraries, selected to optimize size, coverage of the map and consistent overlap. The clones have been rearrayed into 384-well format. The identity of clones has been validated by fingerprinting. Following the first round selection of 29,035 clones, a combination of automated and visual fingerprint inspection identified 1,978 clones that did not match the fingerprints stored in the fingerprint map. 4,531 clones were added to the set to maximally conserve map coverage of the unmatched clones. Analysis of the set's sequence coverage (UCSC, 2002/06 assembly) resulted in the selection of an additional 1,258 clones, with some chosen from outside the fingerprint map, to cover gaps larger than 10 kb. During the second round of fingerprint validation 413 clones were rejected.
The clone set covers 99% of the November 2001 version of the BAC fingerprint map. Using fingerprint-based localization, end sequence data and assembly coordinate data, 30,561 of the clones were localized within the genome and found to cover 2.788 Gb (99%+) of the assembled sequence. Approximately 35 Mb of this coverage was provided by clones not found in the fingerprint map. Approximately 82% of the assembly is covered at 1X and 2X in a 1:1 ratio. The sequence coverage of the set contains 729 sequence coverage gaps totaling 24Mb, with 46% of the gaps being smaller than 10kb. The average resolution of the clone set is 46kb.
This first version of the clone set is publicly distributed through the BACPAC Resources Centre (Childrens Hospital, Oakland). We anticipate that the set will evolve as new versions of the sequence assembly and physical map are released. We are planning to create an analogous resource for the mouse and rat genomes.
An Integrated Approach to Transposon-Mediated Full Length cDNA Sequencing
Butterfield Y, MacDonald K, Stott J, Yang G, Smailus D, Griffith O, Guin R, Barber S, Girn N, Lee D, Prabhu A, Tsai M, Schein J, Jones S, Marra M
As a participant in the Mammalian Gene Collection (MGC) initiative (http://mgc.nci.nih.gov/)1, we have generated accurate and complete sequence for 6,808 human and mouse genes. We have derived both computational and laboratory techniques to efficiently sequence these small insert clones. In addition to the publicly available software Phred, Phrap2 and Consed3, we have developed a number of programs to expedite the sequencing of the clones, and improve communication between the biochemistry and bioinformatics activities in our laboratory.
After sizing of clones and EST sequencing, results from BLAST and BLAT are used to confirm the identity of the clones and to check for problematic clones such as chimeras. The complete sequence for smaller sized cDNAs can sometimes be deduced from EST sequences alone. For determining the full-length sequence of larger cDNAs, we make use of the Mu transposon. A number of different laboratory strategies are used to facilitate the sequencing of clones from various libraries and vector systems. In order to completely avoid repeated sequencing of vector, we have used Gateway technology (Invitrogen) when possible. Sequencing libraries are constructed containing up to 96 cDNA clones into which transposons are randomly inserted. Algorithms make use of insert sizing and DNA concentration information to divide clones into appropriate pools and to ensure that sequencing reads from these pooled libraries cover clones of non-uniform size evenly.
The EST sequences, sequences derived from transposons, and finishing reads for all the clones from a transposon pool are assembled together. Clones are automatically identified based on appropriate ESTs in the assembled contigs. We analyze sequence contigs computationally and automatically identify those that pass sequence integrity and quality checks and those that require further sequencing. We have also been able to analyze the transposon insertion profile and the effect on pooling sets of cDNA clones in various ratios.
Transcriptome Profiling Technologies at the British Columbia Cancer Agency
Khattra J, Chan S, Asano J, Pandoh P, Coughlin S, McDonald H, Girn N, Vatcher G, Schnerch A, Freeman D, Zuyderduyn S, Leung D, Teague K, Jones S, Marra M
The Genome Sciences Centre has established a Gene Expression Laboratory with the purpose of consolidating high throughput transcriptome analysis platforms and related technology development.
Global gene expression profiling is being applied to research in cancer genomics (V. Ling, C. Eaves), atlas of development in mouse (P. Hoodless, E. Simpson, C. Helgason), biology of C. elegans (D. Baillie, D. Moerman, D. Riddle, J. McGhee), a variety of mouse and human embryonic stem cells (K. Humphries, C. Eaves, J. Thomson), transient hypoxia of human tumor cells (R. Durand), and host response to pathogens (R. Brunham, C. Astell).
Our choice and design of experimental platforms addresses the following issues critical to successful gene expression profiling in the wide variety of projects listed above:
- Isolation and analysis of top quality RNA, including efficient isolation from minute samples.
- Quantitative global gene expression profiling of both known and novel gene transcripts using Affymetrix GeneChips and Serial Analysis of Gene Expression (SAGE).
- Accurate validation of selected transcripts via quantitative real-time RT-PCR (ABI Prism 7900HT Sequence Detection System).