Data Analysis | Genome Sciences Centre

What kind of bioinformatics analysis do you provide?

Please see the Bioinformatic Services page for a full list of the standard analyses we provide.

If you have a custom analysis in mind that is not listed, please contact us directly.

How much pass filter data am I guaranteed?

We do not have minimum data guarantees, as the data yield depends too much on the sample supplied. We use our internal QC standards to ensure that the best possible data is generated for each sample.

What kind of bioinformatics QC do you do on my samples?

We employ a wide range of quality control metrics in our bioinformatics QC pipeline:
Assessment of technical contaminants such as adapters and sequencing reagents as well as biological contaminants such as bacteria or host species in xenograft samples.
We also look at library type specific metrics such as insert size and duplicate rates for whole genome libraries, ribosomal and mitochondrial content for RNAseq libraries, and capture efficiency for exome libraries.
If multiple samples are submitted from the same patient, we check for possible sample swaps by comparing the samples at positions of common single nucleotide polymorphism.
Selected QC warnings are given if your libraries fall below our standard thresholds. A full list can be found here

In what format do I receive my sequencing data?

Your data is provided in both fastq and BAM formats by default.

Alignment is included in the sequencing price for human samples.

For more information regarding the BAM file format, please see https://samtools.github.io/hts-specs/SAMv1.pdf

What software is used for alignment?

Alignment is performed using the Burrows-Wheeler Aligner (BWA) program. Novoalign is used for bisulphite sequence data. Additional alignment, with specific client specified parameters or other aligners may be available upon request at an additional cost. Please contact us for more information.

What reference genome is used for the alignment?

Our current default human genome reference version is hg38, although we support hg19.

Please contact us directly for our default genome reference version for any other species.

What if I want my data aligned to a different reference? What if there is no reference for my data?

You can specify any valid reference version for us to use in your alignment when you submit your samples. If we do not have the reference installed in house there will be a cost recovery for installing your reference. If there is no existing public reference for your data, you can provide a custom fasta file, as long as the fasta file can be indexed and is compatible with our aligner. If no reference is provided or the custom reference is not correctly formatted, your BAM file will simply contain all unaligned reads in BAM format.

What reads are included in the BAM file?

All of the raw data is included in the final BAM file, with reads failing the vendor quality checks flagged to allow the user to remove them if desired.

Data sequenced on the HiseqX will not contain quality failed reads as the instrument does not output them. Unaligned reads are also included in the BAM file.

Are pooled libraries automatically split by index?

Yes, data from pooled libraries will be supplied to you after splitting by index. Indices are sequenced on a separate read so your data will not contain any indices.

Do you trim the adaptors from my sequence data?

We do not trim adapter sequences from our fastq or BAM files. Generally aligners are able to handle adapter sequence at the end of reads by softclipping. The exceptions are bisulphite sequencing reads which are hardclipped in the alignment stage, and miRNA sequencing data for which we do trim adapters due to the short length of the reads and the need for higher sequence specificity in our miRNA profiling pipeline (http://www.bcgsc.ca/platform/bioinfo/software/mirna-profiling). BAM files for both of these library types will not contain adapter sequence.

How do I access my data?

All collaborators will receive an email informing them that their data is available for download from our SFTP site. The email is a receipt, identifying which data has recently been made available in addition to the previously uploaded data sets from the same project. This allows the collaborator to track sequence data as it is generated. Data will be automatically deleted from the download site after two weeks. If you are unable to download your data within two weeks, please contact us to reupload your data. If at that time your data are still available for upload, there may be an additional cost for the re-posting. By default the notification email will be sent to the principal investigator listed on the sample submission and submitter of the samples. Additional email recipients can be specified during submission Once the notification email has been sent, a separate email with login and password details for the SFTP site will be sent to the PI. To protect the privacy of your data, subsequent amendments to the recipient list and creation of additional SFTP accounts will require approval from the PI. If you do not have a SFTP client on your computer, you will need to download and install one before you are able to access your data. Please visit our webpage for a list of some recommended clients that can be downloaded for free, along with links to installation instructions: http://www.bcgsc.ca/services/solseq/sequencing-data-access-using-sftp

How do I match my sample names with the sample names on my data files?

With every data upload, we provide a gsc_library.summary file which can be found in the SFTP folder containing your data. This file provides a mapping between our internal library names and the sample names which you provide on your sample submission form. If you have any problems with your data, please contact data_support@bcgsc.ca

Do you have suggested tools for viewing my BAM files?

Some popular toolsets for working with and viewing sequence data are:

IGV (https://www.broadinstitute.org/igv/ )
Picard (http://broadinstitute.github.io/picard/ )
SAMTools (http://samtools.sourceforge.net/ )

For any questions related to the analysis of your data or questions about particular analysis software the SEQanswers (http://seqanswers.com), Biostars (https://www.biostars.org/) and Canadian Bioinformatics Helpdesk (https://bioinformatics.computecanada.ca/) forums are useful resources.

What is your data retention policy?

Sequence data is stored for a minimum of 45 days, and may be deleted after that time without notice.

I need to make my data publicly available for a publication. Can the GSC host my sequencing data or submit my data for me?

As much as the GSC would like to help our collaborators get their data published, we do not have the ability to host data publicly here. However, we have experience submitting to most public repositories including SRA, dbGAP, cgHub and EGA and are happy to answer questions you may have.

We are also able to provide submission support on a cost recovery basis.