The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. Previously, we have identified prevalent HLA class I and class II alleles, including DPA1*02:02, in two small patient cohorts at the COVID-19 pandemic onset.
We have since analyzed a larger public patient cohort data (n = 126 patients) with controls, associated demographic and clinical data. By combining the predictive power of multiple in silico HLA predictors, we report on HLA-I and HLA-II alleles, along with their associated risk significance.
We observe HLA-II DPA1*02:02 at a higher frequency in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher’s exact test [FET] p = 0.0174). Having this allele, however, does not appear to put this cohort’s patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes, including admission to intensive care, reveal nominally significant risk associations with A*11:01 (FET p = 0.0078) and C*04:01 (FET p = 0.0087). The association with severe disease outcome is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni p = 0.0323). While prevalence of some of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with HLA-I C*04:01 tend to fare worse overall. This HLA allele may hold potential clinical value.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Short-read DNA sequencing instruments can yield over 10^12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict an implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly.
Tandem repeat (TR) expansion is the underlying cause of over 40 neurological disorders. Long-read sequencing offers an exciting avenue over conventional technologies for detecting TR expansions. Here, we present Straglr, a robust software tool for both targeted genotyping and novel expansion detection from long-read alignments. We benchmark Straglr using various simulations, targeted genotyping data of cell lines carrying expansions of known diseases, and whole genome sequencing data with chromosome-scale assembly. Our results suggest that Straglr may be useful for investigating disease-associated TR expansions using long-read sequencing.
Background: Screening for short tandem repeat (STR) expansions in next-generation sequencing data can enable diagnosis, optimal clinical management/treatment, and accurate genetic counseling of patients with repeat expansion disorders. We aimed to develop an efficient computational workflow for reliable detection of STR expansions in next-generation sequencing data and demonstrate its clinical utility.
Methods: We characterized the performance of eight STR analysis methods (lobSTR, HipSTR, RepeatSeq, ExpansionHunter, TREDPARSE, GangSTR, STRetch, and exSTRa) on next-generation sequencing datasets of samples with known disease-causing full-mutation STR expansions and genomes simulated to harbor repeat expansions at selected loci and optimized their sensitivity. We then used a machine learning decision tree classifier to identify an optimal combination of methods for full-mutation detection. In Burrows-Wheeler Aligner (BWA)-aligned genomes, the ensemble approach of using ExpansionHunter, STRetch, and exSTRa performed the best (precision = 82%, recall = 100%, F1-score = 90%). We applied this pipeline to screen 301 families of children with suspected genetic disorders.
Results: We identified 10 individuals with full-mutations in the AR, ATXN1, ATXN8, DMPK, FXN, or HTT disease STR locus in the analyzed families. Additional candidates identified in our analysis include two probands with borderline ATXN2 expansions between the established repeat size range for reduced-penetrance and full-penetrance full-mutation and seven individuals with FMR1 CGG repeats in the intermediate/premutation repeat size range. In 67 probands with a prior negative clinical PCR test for the FMR1, FXN, or DMPK disease STR locus, or the spinocerebellar ataxia disease STR panel, our pipeline did not falsely identify aberrant expansion. We performed clinical PCR tests on seven (out of 10) full-mutation samples identified by our pipeline and confirmed the expansion status in all, showing absolute concordance between our bioinformatics and molecular findings.
Conclusions: We have successfully demonstrated the application of a well-optimized bioinformatics pipeline that promotes the utility of genome-wide sequencing as a first-tier screening test to detect expansions of known disease STRs. Interrogating clinical next-generation sequencing data for pathogenic STR expansions using our ensemble pipeline can improve diagnostic yield and enhance clinical outcomes for patients with repeat expansion disorders.
As the year 2020 came to a close, several new strains have been reported of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the agent responsible for the coronavirus disease 2019 (COVID-19) pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the year.
In this study, researchers assessed if there was a link between host immunity genes (e.g., Human Leukocyte Antigen (HLA) genes) and susceptibility or resistance to COVID-19 using transcriptome sequencing (RNA-Seq) on libraries made from the bronchoalveolar lavage (BAL) fluid and peripheral blood mononuclear cell (PMBC) samples of COVID-19 patients.
Presence or absence of gene fusions is one of the most important diagnostic markers in many cancer types. Consequently, fusion detection methods using various genomics data types, such as RNA sequencing (RNA-seq) are valuable tools for research and clinical applications. While information-rich RNA-seq data have proven to be instrumental in discovery of a number of hallmark fusion events, bioinformatics tools to detect fusions still have room for improvement. Here, we present Fusion-Bloom, a fusion detection method that leverages recent developments in de novo transcriptome assembly and assembly-based structural variant calling technologies (RNA-Bloom and PAVFinder, respectively). We benchmarked Fusion-Bloom against the performance of five other state-of-the-art fusion detection tools using multiple datasets. Overall, we observed Fusion-Bloom to display a good balance between detection sensitivity and specificity. We expect the tool to find applications in translational research and clinical genomics pipelines.
Haploid cell lines are a valuable research tool with broad applicability for genetic assays. As such the fully haploid human cell line, eHAP1, has been used in a wide array of studies. However, the absence of a corresponding reference genome sequence for this cell line has limited the potential for more widespread applications to experiments dependent on available sequence, like capture-clone methodologies. We generated ~15× coverage Nanopore long reads from ten GridION flowcells and utilized this data to assemble a de novo draft genome using minimap and miniasm and subsequently polished using Racon. This assembly was further polished using previously generated, low-coverage, Illumina short reads with Pilon and ntEdit. This resulted in a hybrid eHAP1 assembly with >90% complete BUSCO scores. We further assessed the eHAP1 long read data for structural variants using Sniffles and identify a variety of rearrangements, including a previously established Philadelphia translocation. Finally, we demonstrate how some of these variants overlap open chromatin regions, potentially impacting regulatory regions. By integrating both long and short reads, we generated a high-quality reference assembly for eHAP1 cells. The union of long and short reads demonstrates the utility in combining sequencing platforms to generate a high-quality reference genome de novo solely from low coverage data. We expect the resulting eHAP1 genome assembly to provide a useful resource to enable novel experimental applications in this important model cell line.
Bioinformatics (Oxford, England), 2019
In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes.
Bioinformatics (Oxford, England), 2019
The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research.
Microbiology resource announcements, 2019
Engelmann spruce () is a conifer found primarily on the west coast of North America. Here, we present the complete chloroplast genome sequence of genotype Se404-851. This chloroplast sequence will benefit future conifer genomic research and contribute resources to further species conservation efforts.
Microbiology resource announcements, 2019
Here, we present the complete chloroplast genome sequence of white spruce (, genotype WS77111), a coniferous tree widespread in the boreal forests of North America. This sequence contributes to genomic and phylogenetic analyses of the genus that are part of ongoing research to understand their adaptation to environmental stress.
Scientific reports, 2019
Antimicrobial peptides (AMPs) exhibit broad-spectrum antimicrobial activity, and have promise as new therapeutic agents. While the adult North American bullfrog (Rana [Lithobates] catesbeiana) is a prolific source of high-potency AMPs, the aquatic tadpole represents a relatively untapped source for new AMP discovery. The recent publication of the bullfrog genome and transcriptomic resources provides an opportune bridge between known AMPs and bioinformatics-based AMP discovery. The objective of the present study was to identify novel AMPs with therapeutic potential using a combined bioinformatics and wet lab-based approach. In the present study, we identified seven novel AMP precursor-encoding transcripts expressed in the tadpole. Comparison of their amino acid sequences with known AMPs revealed evidence of mature peptide sequence conservation with variation in the prepro sequence. Two mature peptide sequences were unique and demonstrated bacteriostatic and bactericidal activity against Mycobacteria but not Gram-negative or Gram-positive bacteria. Nine known and seven novel AMP-encoding transcripts were detected in premetamorphic tadpole back skin, olfactory epithelium, liver, and/or tail fin. Treatment of tadpoles with 10 nM 3,5,3'-triiodothyronine for 48 h did not affect transcript abundance in the back skin, and had limited impact on these transcripts in the other three tissues. Gene mapping revealed considerable diversity in size (1.6-15 kbp) and exon number (one to four) of AMP-encoding genes with clear evidence of alternative splicing leading to both prepro and mature amino acid sequence diversity. These findings verify the accuracy and utility of the bullfrog genome assembly, and set a firm foundation for bioinformatics-based AMP discovery.
The grizzly bear ( ssp. ) represents the largest population of brown bears in North America. Its genome was sequenced using a microfluidic partitioning library construction technique, and these data were supplemented with sequencing from a nanopore-based long read platform. The final assembly was 2.33 Gb with a scaffold N50 of 36.7 Mb, and the genome is of comparable size to that of its close relative the polar bear (2.30 Gb). An analysis using 4104 highly conserved mammalian genes indicated that 96.1% were found to be complete within the assembly. An automated annotation of the genome identified 19,848 protein coding genes. Our study shows that the combination of the two sequencing modalities that we used is sufficient for the construction of highly contiguous reference quality mammalian genomes. The assembled genome sequence and the supporting raw sequence reads are available from the NCBI (National Center for Biotechnology Information) under the bioproject identifier PRJNA493656, and the assembly described in this paper is version QXTK01000000.
BMC bioinformatics, 2018
Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap.
BMC medical genomics, 2018
RNA-seq is a powerful and cost-effective technology for molecular diagnostics of cancer and other diseases, and it can reach its full potential when coupled with validated clinical-grade informatics tools. Despite recent advances in long-read sequencing, transcriptome assembly of short reads remains a useful and cost-effective methodology for unveiling transcript-level rearrangements and novel isoforms. One of the major concerns for adopting the proven de novo assembly approach for RNA-seq data in clinical settings has been the analysis turnaround time. To address this concern, we have developed a targeted approach to expedite assembly and analysis of RNA-seq data.
BMC genomics, 2018
Alternative polyadenylation (APA) results in messenger RNA molecules with different 3' untranslated regions (3' UTRs), affecting the molecules' stability, localization, and translation. APA is pervasive and implicated in cancer. Earlier reports on APA focused on 3' UTR length modifications and commonly characterized APA events as 3' UTR shortening or lengthening. However, such characterization oversimplifies the processing of 3' ends of transcripts and fails to adequately describe the various scenarios we observe.