Antibiotic resistance is a global health crisis increasing in prevalence every day. To combat this crisis, alternative antimicrobial therapeutics are urgently needed. Antimicrobial peptides (AMPs), a family of short defense proteins, are produced naturally by all organisms and hold great potential as effective alternatives to small molecule antibiotics. Here, we present rAMPage, a scalable bioinformatics discovery platform for identifying AMP sequences from RNA sequencing (RNA-seq) datasets. In our study, we demonstrate the utility and scalability of rAMPage, running it on 84 publicly available RNA-seq datasets from 75 amphibian and insect species-species known to have rich AMP repertoires. Across these datasets, we identified 1137 putative AMPs, 1024 of which were deemed novel by a homology search in cataloged AMPs in public databases. We selected 21 peptide sequences from this set for antimicrobial susceptibility testing against Escherichia coli and Staphylococcus aureus and observed that seven of them have high antimicrobial activity. Our study illustrates how in silico methods such as rAMPage can enable the fast and efficient discovery of novel antimicrobial peptides as an effective first step in the strenuous process of antimicrobial drug development.
Circulating tumour DNA (ctDNA) in blood plasma is an emerging tool for clinical cancer genotyping and longitudinal disease monitoring1. However, owing to past emphasis on targeted and low-resolution profiling approaches, our understanding of the distinct populations that comprise bulk ctDNA is incomplete2-12. Here we perform deep whole-genome sequencing of serial plasma and synchronous metastases in patients with aggressive prostate cancer. We comprehensively assess all classes of genomic alterations and show that ctDNA contains multiple dominant populations, the evolutionary histories of which frequently indicate whole-genome doubling and shifts in mutational processes. Although tissue and ctDNA showed concordant clonally expanded cancer driver alterations, most individual metastases contributed only a minor share of total ctDNA. By comparing serial ctDNA before and after clinical progression on potent inhibitors of the androgen receptor (AR) pathway, we reveal population restructuring converging solely on AR augmentation as the dominant genomic driver of acquired treatment resistance. Finally, we leverage nucleosome footprints in ctDNA to infer mRNA expression in synchronously biopsied metastases, including treatment-induced changes in AR transcription factor signalling activity. Our results provide insights into cancer biology and show that liquid biopsy can be used as a tool for comprehensive multi-omic discovery.
Diffuse large B cell lymphoma (DLBCL) is the most common B cell non-Hodgkin lymphoma and remains incurable in around 40% of patients. Efforts to sequence the coding genome identified several genes and pathways that are altered in this disease, including potential therapeutic targets1-5. However, the non-coding genome of DLBCL remains largely unexplored. Here we show that active super-enhancers are highly and specifically hypermutated in 92% of samples from individuals with DLBCL, display signatures of activation-induced cytidine deaminase activity, and are linked to genes that encode B cell developmental regulators and oncogenes. As evidence of oncogenic relevance, we show that the hypermutated super-enhancers linked to the BCL6, BCL2 and CXCR4 proto-oncogenes prevent the binding and transcriptional downregulation of the corresponding target gene by transcriptional repressors, including BLIMP1 (targeting BCL6) and the steroid receptor NR3C1 (targeting BCL2 and CXCR4). Genetic correction of selected mutations restored repressor DNA binding, downregulated target gene expression and led to the counter-selection of cells containing corrected alleles, indicating an oncogenic dependency on the super-enhancer mutations. This pervasive super-enhancer mutational mechanism reveals a major set of genetic lesions deregulating gene expression, which expands the involvement of known oncogenes in DLBCL pathogenesis and identifies new deregulated gene targets of therapeutic relevance.
Aim: This study examined circulating cell-free DNA (cfDNA) biomarkers associated with androgen treatment resistance in metastatic castration resistance prostate cancer (mCRPC). Materials & methods: We designed a panel of nine candidate cfDNA methylation markers using droplet digital PCR (Methyl-ddPCR) and assessed methylation levels in sequentially collected cfDNA samples from patients with mCRPC. Results: Increased cfDNA methylation in eight out of nine markers during androgen-targeted treatment correlated with a faster time to clinical progression. Cox proportional hazards modeling and logistic regression analysis further confirmed that higher cfDNA methylation during treatment was significantly associated with clinical progression. Conclusion: Overall, our findings have revealed a novel methylated cfDNA marker panel that could aid in the clinical management of metastatic prostate cancer.
Introduction: Increasingly, logistic regression methods for genetic association studies of binary phenotypes must be able to accommodate data sparsity, which arises from unbalanced case-control ratios and/or rare genetic variants. Sparseness leads to maximum likelihood estimators (MLEs) of log-OR parameters that are biased away from their null value of zero and tests with inflated type 1 errors. Different penalized-likelihood methods have been developed to mitigate sparse-data bias. We study penalized logistic regression using a class of log-F priors indexed by a shrinkage parameter m to shrink the biased MLE towards zero. For a given m, log-F-penalized logistic regression may be easily implemented using data augmentation and standard software.
Method: We propose a two-step approach to the analysis of a genetic association study: first, a set of variants that show evidence of association with the trait is used to estimate m; and second, the estimated m is used for log-F-penalized logistic regression analyses of all variants using data augmentation with standard software. Our estimate of m is the maximizer of a marginal likelihood obtained by integrating the latent log-ORs out of the joint distribution of the parameters and observed data. We consider two approximate approaches to maximizing the marginal likelihood: (i) a Monte Carlo EM algorithm (MCEM) and (ii) a Laplace approximation (LA) to each integral, followed by derivative-free optimization of the approximation.
Results: We evaluate the statistical properties of our proposed two-step method and compared its performance to other shrinkage methods by a simulation study. Our simulation studies suggest that the proposed log-F-penalized approach has lower bias and mean squared error than other methods considered. We also illustrate the approach on data from a study of genetic associations with "super senior" cases and middle aged controls.
Discussion/conclusion: We have proposed a method for single rare variant analysis with binary phenotypes by logistic regression penalized by log-F priors. Our method has the advantage of being easily extended to correct for confounding due to population structure and genetic relatedness through a data augmentation approach.
Imprinting is a critical part of normal embryonic development in mammals, controlled by defined parent-of-origin (PofO) differentially methylated regions (DMRs) known as imprinting control regions. Direct nanopore sequencing of DNA provides a means to detect allelic methylation and to overcome the drawbacks of methylation array and short-read technologies. Here, we used publicly available nanopore sequencing data for 12 standard B-lymphocyte cell lines to acquire the genome-wide mapping of imprinted intervals in humans. Using the sequencing data, we were able to phase 95% of the human methylome and detect 94% of the previously well-characterized, imprinted DMRs. In addition, we found 42 novel imprinted DMRs (16 germline and 26 somatic), which were confirmed using whole-genome bisulfite sequencing (WGBS) data. Analysis of WGBS data in mouse (Mus musculus), rhesus monkey (Macaca mulatta), and chimpanzee (Pan troglodytes) suggested that 17 of these imprinted DMRs are conserved. Some of the novel imprinted intervals are within or close to imprinted genes without a known DMR. We also detected subtle parental methylation bias, spanning several kilobases at seven known imprinted clusters. At these blocks, hypermethylation occurs at the gene body of expressed allele(s) with mutually exclusive H3K36me3 and H3K27me3 allelic histone marks. These results expand upon our current knowledge of imprinting and the potential of nanopore sequencing to identify imprinting regions using only parent-offspring trios, as opposed to the large multi-generational pedigrees that have previously been required.
Spruces (Picea spp.) are coniferous trees widespread in boreal and mountainous forests of the northern hemisphere, with large economic significance and enormous contributions to global carbon sequestration. Spruces harbor very large genomes with high repetitiveness, hampering their comparative analysis. Here, we present and compare the genomes of four different North American spruces: the genome assemblies for Engelmann spruce (Picea engelmannii) and Sitka spruce (P. sitchensis) together with improved and more contiguous genome assemblies for white spruce (P. glauca) and for a naturally occurring introgress of these three species known as interior spruce (P. engelmannii × glauca × sitchensis). The genomes were structurally similar, and a large part of scaffolds could be anchored to a genetic map. The composition of the interior spruce genome indicated asymmetric contributions from the three ancestral genomes. Phylogenetic analysis of the nuclear and organelle genomes revealed a topology indicative of ancient reticulation. Different patterns of expansion of gene families among genomes were observed and related with presumed diversifying ecological adaptations. We identified rapidly evolving genes that harbored high rates of nonsynonymous polymorphisms relative to synonymous ones, indicative of positive selection and its hitchhiking effects. These gene sets were mostly distinct between the genomes of ecologically contrasted species, and signatures of convergent balancing selection were detected. Stress and stimulus response was identified as the most frequent function assigned to expanding gene families and rapidly evolving genes. These two aspects of genomic evolution were complementary in their contribution to divergent evolution of presumed adaptive nature. These more contiguous spruce giga-genome sequences should strengthen our understanding of conifer genome structure and evolution, as their comparison offers clues into the genetic basis of adaptation and ecology of conifers at the genomic level. They will also provide tools to better monitor natural genetic diversity and improve the management of conifer forests.
Anti-CD19 chimeric antigen receptor (CAR)-T therapy for B cell malignancies has shown clinical success, but a major limitation is the logistical complexity and high cost of manufacturing autologous cell products. If engineered for improved safety, direct infusion of viral gene transfer vectors to initiate in vivo CAR-T transduction, expansion, and anti-tumor activity could provide an alternative, universal approach. To explore this approach we administered approximately 20 million replication-incompetent vesicular stomatitis virus G protein (VSV-G) lentiviral particles carrying an anti-CD19CAR-2A-GFP transgene comprising either an FMC63 (human) or 1D3 (murine) anti-CD19 binding domain, or a GFP-only control transgene, to wild-type C57BL/6 mice by tail vein infusion. The dynamics of immune cell subsets isolated from peripheral blood were monitored at weekly intervals. We saw emergence of a persistent CAR-transduced CD3+ T cell population beginning week 3-4 that reaching a maximum of 13.5% ± 0.58% (mean ± SD) and 7.8% ± 0.76% of the peripheral blood CD3+ T cell population in mice infused with ID3-CAR or FMC63-CAR lentivector, respectively, followed by a rapid decline in each case of the B cell content of peripheral blood. Complete B cell aplasia was apparent by week 5 and was sustained until the end of the protocol (week 8). No significant CAR-positive populations were observed within other immune cell subsets or other tissues. These results indicate that direct intravenous infusion of conventional VSV-G-pseudotyped lentiviral particles carrying a CD19 CAR transgene can transduce T cells that then fully ablate endogenous B cells in wild-type mice.
Background: De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes.
Results: Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%.
Conclusions: RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Background: To support the implementation of high-throughput pipelines suitable for SARS-CoV-2 sequencing and analysis in a clinical laboratory, we developed an automated sample preparation and analysis workflow.
Methods: We used the established ARTIC protocol with ∼400 bp amplicons sequenced on Oxford Nanopore's MinION. Sequences were analyzed using Nextclade, assigning both a clade and quality score to each sample.
Results: 2,179 samples on twenty-five 96-well plates were sequenced. Plates of purified RNA were processed within 12 hours, sequencing required up to 24 hours and analysis of each pooled plate required one hour. The use of samples with known Ct values enabled normalization, acted as a QC check, and revealed a strong correlation between sample Ct values and successful analysis, with 85% of samples with Ct < 30 achieving a "Good" Nexclade score. Less abundant samples responded to enrichment with the fraction of Ct > 30 samples achieving a "Good" classification rising by 60% after addition of a post-ARTIC PCR normalization. Serial dilutions of three variant of concern samples, diluted from Ct∼16 to Ct∼50, demonstrated successful sequencing to Ct 37. The sample set contained a median of 24 mutations per sample and a total of 1,281 unique mutations with reduced sequence read coverage noted in some regions of some samples. A total of ten separate strains were observed in the sample set, including three variants of concern prevalent in British Columbia in the spring of 2021.
Conclusions: We demonstrated a robust automated sequencing pipeline that takes advantage of input Ct values to improve reliability.