Data Loading and Tidying

RNA-seq Data

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/expr.tsv

How it was generated: The RNA-seq data was aligned using TopHat2 and the reads were counted for each gene using htseq featureCount. Only protein-coding CDS regions were used as features. The expression counts from individual samples were consolidated into this file by running the following command on Thanos.

# On thanos, working directory: /home/rmorin/work/expression/htseq_rerun/
files=(DLC*.htseq)
(echo gene ${files%.*} | tr " " "\t"; paste ${files} | cut -f1,$(echo {2..$(expr 2 \* ${#files[@]})..2} | tr " " ",")) > expr.tsv

The RNA-seq expression data is loaded as a data frame and converted into a matrix. No further processing is done.

expr <- 
  readr::read_tsv(expr_path, progress = FALSE) %>% 
  dplyr::filter(!grepl("^__", gene)) %>% 
  as.data.frame() %>% 
  tibble::column_to_rownames("gene") %>% 
  as.matrix()

patients <- list()
patients$rnaseq <- colnames(expr)

gene_ids <- list()
gene_ids$rnaseq <- rownames(expr)

testthat::expect_false(any(is.na(expr)))
testthat::expect_false(any(duplicated(patients$rnaseq)))
testthat::expect_false(any(duplicated(gene_ids$rnaseq)))

Lymph2Cx COO Status

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/coo.tsv

How it was generated: This file contains the COO classification from the NanoString Lymph2Cx assay for the entire DLC cohort. It was copied from Thanos: /home/rmorin/work/expression/coo_fix.txt.

For the NanoString COO classification, I’m annotating the COO codes (1, 2 or 3) with meaningful names. I’m also creating a fourth class called “NA” for any samples that failed the assay (i.e. missing data).

coo <-
  readr::read_tsv(coo_path, col_names = c("patient", "code"), skip = 1) %>%
  dplyr::left_join(coo_codes, by = "code") %>%
  dplyr::select(-code) %>%
  tidyr::replace_na(list(coo_ns = "NA")) %>%
  dplyr::mutate(rowname = patient) %>%
  as.data.frame() %>%
  tibble::column_to_rownames("rowname")

patients$coo <- coo$patient

testthat::expect_false(any(is.na(coo)))

Normal Content

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/normal_content.tsv

How it was generated: This file contains the normal tissue content estimates by OncoSNP.

# On gphost01, working directory: /genesis/extscratch/shahlab/ssaberi/ONCOSNP/stromal-intra
awk 'BEGIN {FS=OFS="\t"; print "patient", "nc"} $8 == 2 {print FILENAME,$2}' *.qc | sed 's/\.qc//' > /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/normal_content.tsv

nc <- readr::read_tsv(nc_path)

There are 9 patients in the Lymph2Cx dataset that are not represented in this dataset. I’m removing any patients that are not in the Lymph2Cx dataset.

nc <- dplyr::filter(nc, patient %in% patients$coo)

Mutation Data

SNVs and indels

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/snvs_and_indels.maf

How it was generated: This file contains mutation data from our lab’s lymphoma gene hybridization capture and sequencing experiment in MAF format. It was copied from the following path on the Brainiac NAS:

morinlab/projects/2016_dlbcl_dlc_lymphoma_gene_pool/analysis/strelka_pipeline/10-post_process/clean.passed.all.filt.maf

muts <- 
  readr::read_tsv(snvs_and_indels_path, progress = FALSE) %>%
  dplyr::filter(!is.na(HGVSp_Short),
                HGVSp_Short != "p.=") %>%
  dplyr::mutate(patient = sub("_", "", Tumor_Sample_Barcode),
                change = sub("^p[.]", "", HGVSp_Short),
                type = get_type(Consequence),
                codon = get_codon(type, change)) %>%
  dplyr::select(patient, gene = Hugo_Symbol, change, type, codon)

patients$muts <- unique(muts$patient)

testthat::expect_false(any(is.na(muts)))
testthat::expect_length(setdiff(patients$muts, patients$coo), 0)

All Mutations

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/data/mutations.tsv

How it was generated: This file contains mutation calls from a number of sources. Note that certain mutation types are restricted to those that affect genes of interest.

SNVs and indels: Strelka calls using targeted capture hybridization sequencing data
Structural variations: Manta calls using targeted sequencing data
Copy number variations: OncoSNP calls using SNP6.0 data

It was copied from the following path on the Brainiac NAS:

morinlab/projects/2016_dlbcl_dlc_lymphoma_gene_pool/analysis/strelka_pipeline/13-post_process/all_muts.passed.oncoprinter.tsv

oncoprint_cols <- c("patient", "gene", "change", "type")

muts_raw <- suppressWarnings(
  readr::read_tsv(muts_path, skip = 1, col_names = oncoprint_cols, progress = FALSE)) 

muts <- 
  muts_raw %>%
  dplyr::mutate(patient = sub("_", "", patient)) %>% 
  dplyr::filter(!is.na(gene),
                patient %in% patients$rnaseq) %>%
  dplyr::mutate(type = tolower(type),
                codon = get_codon(type, change)) %>%
  dplyr::filter(!(change == "GAIN"))

patients$muts <- unique(muts$patient)

testthat::expect_false(any(is.na(muts)))

N.B. There are 3 patients represented in the RNA-seq data but not in the mutation data.

Gene lists

Wright et al. genes

Input files: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/reference/wright_genes.txt

How it was generated: This file contains a list of Ensembl gene IDs reported by Wright et al. to define COO status. It was compiled by Ryan.

gene_ids$wright <- 
  readr::read_lines(wright_gene_ids_path) %>% 
  intersect(gene_ids$rnaseq)

# Ensure no NAs
testthat::expect_false(any(is.na(gene_ids$wright)))

N.B. There are 156 genes in the Wright gene list.

Lymph2Cx genes

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/reference/lymph2cx_genes.txt

How it was generated: This files contains the Ensembl gene IDs and symbols for the 20 genes included in the Lymph2Cx NanoString assay. It was manually generated by obtaining the list of gene symbols from Scott et al. (2014) and obtaining the Ensembl gene IDs from the GRCh37 Ensembl BioMart (release 75).

gene_ids$lymph2cx <- 
  readr::read_tsv(lymph2cx_gene_ids_path) %$%
  intersect(ens_id, gene_ids$rnaseq)

testthat::expect_false(any(is.na(gene_ids$lymph2cx)))

N.B. There are 20 genes in the Lymph2Cx gene list.

NF-κB genes

Input file: /projects/bgrande/experiments/2016-08-10_dlbcl_coo_classification/reference/nfkb_genes.txt

How it was generated: Ask Ryan.

gene_ids$nfkb <- 
  readr::read_lines(nfkb_gene_ids_path) %>%
  intersect(gene_ids$rnaseq)

testthat::expect_false(any(is.na(gene_ids$nfkb)))

N.B. There are 115 genes in the NF-κB gene list.

Gene Symbols

To annotate the gene IDs in the RNA-seq dataset, I’m using the BioMart API to obtain HGNC gene symbols from Ensembl 75 (Feb. 2014), which is the last version updated for GRCh37. If no gene symbol was found, I fall back on the gene ID.

ensembl_75 <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
                               host = "feb2014.archive.ensembl.org", 
                               dataset = "hsapiens_gene_ensembl")

attrs <- c("ensembl_gene_id", "external_gene_id")

genes_df <- 
  biomaRt::getBM(mart = ensembl_75, attributes = attrs) %>%
  dplyr::rename(gene_id = ensembl_gene_id, gene = external_gene_id) %>%
  dplyr::slice(match(gene_ids$rnaseq, gene_id)) %$% {
    dups <- unique(gene[duplicated(gene)])
    dplyr::mutate(., gene = ifelse(gene %in% dups, paste(gene, gene_id, sep = "_"), gene))
  } %>%
  dplyr::mutate(wright = gene_id %in% gene_ids$wright,
                lymph2cx = gene_id %in% gene_ids$lymph2cx)

genes <- genes_df$gene

testthat::expect_false(any(is.na(genes)))
testthat::expect_false(any(duplicated(genes)))
testthat::expect_length(setdiff(gene_ids$rnaseq, genes_df$gene_id), 0)
# testthat::expect_length(setdiff(gene_ids$wright, genes_df$gene_id), 0)
testthat::expect_length(setdiff(gene_ids$lymph2cx, genes_df$gene_id), 0)

There are currently 0 Wright genes that are not represented in the RNA-seq dataset. I believe this is because the Wright genes were taken from a later version of Ensembl. I’ll either determine their older equivalent IDs or I’ll update the RNA-seq data to use more recent gene models.