Junction Alignments to Genome for RNA-seq Reads

 

Documentation

JAGuaR is an alignment protocol for paired end RNA-seq reads based on an extended reference.  It uses BWA to align reads to the genome and reference transcript models (including annotated exon-exon junctions) specifically allowing the possibility for a single read to span multiple exons. Reads aligned to the special reference repository are then “repositioned” on to genomic coordinates, transforming reads spanning multiple exons into large-gapped alignments.

Availability: JAGuaR is implemented in Python, and is freely available for download at

http://www.bcgsc.ca/platform/bioinfo/software/jaguar

JAGuaR creates a genome + junctions reference to be used for different read length alignments.  JAGuaR has been tested with alignments using BWA (http://bio-bwa.sourceforge.net/).  After reads are aligned to this reference with BWA, convertJunctions converts the junction aligned reads to genomic coordinates.

 

A)  Pipeline

 

1.       Create tab delimited transcript annotation file in format which is used to create the junction reference where exon start and end is 1-based inclusive:

transcript_id chr exon_start exon_end exon_rank strand

 

This format can be easily generated using BioMart by choosing ‘structures’ and then clicking the appropriate categories in the  GENE and EXON section in the order you want them to appear from  left to right.  http://www.biomart.org/biomart/martview/

 

See Section C for an example of a formatted annotation file.

 

 

2.       Run createJunctions

 

Creates the exon-exon junction sequences for each chromosome and an index with the stub chrX_JN.  Concatenates the files created in step2, with the reference genome fasta (i.e. hg19.fa) file to create the genome + junction reference sequence (i.e ref.fa) for alignment.

 

This has to be run once for each read size such as 50bp, 75bp, 100bp to create ref.fa.

 

   python createJunctions.py -c junction.cfg

The reference file containing exon-exon junction sequence needs to be indexed before alignment.

   bwa index -a bwtsw ref.fa

 

 See Section B for an example configuration file.

  

 Note:

The chromosome naming used in the transcript annotation file should match the naming used in the genome fasta.  If the annotation includes chromosomes not included in the genomic fasta, a message of the form "INFO Chromosome <$chrName> is not included in the genomic fasta" will appear in the createJunctions.log file.  In addition the version of BWA used to index the reference should be the same version used for the initial alignment.

 

 

3.       Alignment

Alignments of reads to the JAGuaR reference has only been tested with BWA in the following pipeline:

 

bwa index -a bwtsw ref.fa       [only needs to be run once/genome]

 

bwa aln ref.fa read1.fastq > read1.sai

bwa aln ref.fa read2.fastq > read2.sai

 

bwa sampe -s ref.fa read1.sai read2.sai read1.fastq read2.fastq > aln.sam

 

JAGuaR has been tested with the BWA-MEM algorithm and while some development is required for JAGuaR to make full use of the BWA-MEM algorithm, running in its current version gives similar results for BWA but overall runtime is much faster.

 

bwa-0.7.4/bwa mem -t 8 ref.fa read1.fastq read2.fastq > aln.sam

 

For runs with BWA-MEM only, a post alignment script provided in the JAGuaR package needs to be run before convertJunctions.  This chooses the best read when a read is multi-mapped or split.

 

python samFilterSelectReads.py aln.sam > aln.filtered.sam

 

 

4.       Run convertJunctions  to create a repositioned sam file that is sorted by read name:

python convertJunctions.py --samtools samtools –f aln.sam –o myaln -m 34622 -i ref100/ -s

To create a repositioned bam file sorted by coordinate and indexed:

 

python convertJunctions.py --samtools samtools –f aln.sam –o myaln -m 34622 -i ref100/ -s -b -g GRCh37-lite.fa

5.       Use samtools to create bam file (running convertJunctions with -b  and -g does this step automatically):

 

samtools faidx hg19.fa [index reference genome if not already done so]

 

samtools view –Sb -T hg19.fa myaln.repos.sam > myaln.repos.bam

 

samtools sort myaln.repos.bam myaln.repos.sorted

 

samtools index myaln.sorted.repos.bam [not needed for SNP calling but for IGV]

 

B)  CreateJunctions  configuration file example

Junction.cfg

[GENOME_REF]

species=homo_sapiens

genome=hg19a

genome_fasta=GRCh37-lite.fa

transcript_file=ens61Genes.txt

 

[RUN_TIME]

out_dir=./results

seq_length=100

 

[HEADER_ANNOTATION]

SQ_header_AS=NCBI-Build-37

SQ_header_UR=GRCh37-lite.fa

SQ_header_SP=Homo sapiens


This configuration file is required by createJunctions to setup the reference that the RNA-seq reads are aligned to.  Under GENOME_REF, information on the species and genome name is required along with the name and path of the genome reference  to be used and the transcriptome annotation file.  Under RUN_TIME, list the output directory that will hold the reference and indicate the size of the reads that will be used.  Header information is listed under HEADER_ANNOTATION.

C)  Formatted transcriptome annotation example

A tab delimited transcript text file (referred to above as transcript_file) is required to create the junction reference. Its format currently must be one line per exon for all transcripts in the form of Transcript ID, Chromosome Name (with or without "chr"), Exon Start Coordinate, Exon End Coordinate, Exon Rank in Transcript, Strand) as per below.   This can easily be created at http://www.ensembl.org/biomart/ by selecting "Ensembl Genes", the genome dataset of interest, "Attributes" and then "Structures" .  Then choose in this order: Ensembl Transcript ID, Chromosome Name, Exon Chr Start (bp), Exon Chr End (bp),  Exon Rank in Transcript, Strand.  Export all results to a TSV (tab separated values) file.  When read, the first header line will be ignored.

Ensembl Transcript ID   Chromosome Name Exon Chr Start (bp)     Exon Chr End (bp)       Exon Rank in Transcript    Strand
ENST00000416909 13      96185952        96186164        1       -1

ENST00000416909 13      96180746        96180841        2       -1

ENST00000416909 13      96149313        96149379        3       -1

ENST00000416909 13      96131698        96132216        4       -1

ENST00000439928 13      24553949        24554063        1       1

ENST00000439928 13      24557866        24558072        2       1

ENST00000439928 13      24591806        24591840        3       1

ENST00000439928 13      24607652        24607908        4       1

ENST00000439928 13      24608060        24609166        5       1


D)  ConvertJunctions input parameters

Input: SAM or BAM file of paired end tag (PET) reads aligned to Jaguar reference/junctions with BWA (sampe -s) and sorted by name

Output: SAM file with genomic coordinates with appropriate reads split by exon

 

Usage: convertJunctions.py [options]

 

Options:

  -h, --help           show this help message and exit

  -f file              PET sorted .sam or .bam (f)ile. REQUIRED

  -o o                 (O)utput stub for repositioned file incl. directory,

                       otherwise output in cwd; i.e., [Output tag].repos.sam.

                       REQUIRED

  -i index             (I)ndex folder. REQUIRED

  -m max_separation    (M)ax separation. (M)ax separation

                       recommendations: 34622 is used for hg19a, 23545 is used

                       for hg18 REQUIRED

  -g genome            (G)enome reference fasta file with path. Required if

                       using option -b

  --samtools=SAM_EXEC  SAMTOOLS executable

  -b                   create (B)am after sam conversion

  -c check             chromosome index check, default=1

  -d                   print debugging messages to screen

  -s                   create cigar.txt (CIGAR errors) and .checks (FLAG

                       errors) files

  -x chromosome        filter by comma delimited list of chromosomes, default

                       all chromosomes listed in genome length file

  -l l                 Number (L)ines/reads to reposition from beginning of -f

                       file

  -r r                 pull out read(s) and create reads.sam

  -n                   NO to repositioning, assumes a repositioned file

                       already exists

  -e                   Create bed file of exon-exon junction covered read

                       Coordinates

  --do_not_check_mito_naming

                        if not specified, mitochondrial chromosome M (chrM)

                        will be renamed MT (chrMT)

  --do_not_drop_chr_prefix

                        If not specified, the 'c-h-r' prefix of any chromosome

                        name will be dropped (chr8 becomes 8)

 

E)  Example Output after Repositioning

JAGuaR Summary

-----------------------------------------------------------------------------

Start:  Wed Mar  6 12:57:48 2014

Finish: Wed Mar  6 18:55:50 2014

File analyzed: testrun.bam

Output file with repositioned reads: testrun.repos.sam

Reference Index Folder: /ref/

Repositioning time: 357.450638203 minutes

Max Separation: 34622

 

Total number of reads: 246197358

Total number of reads aligned to junctions: 38736275

 

Junction reads already unmapped in first alignment: 2072703

Junction reads set unmapped after repositioning: 3304

Junction reads mapped and repositioned with large gaps: 36432882

Junction reads repositioned without large gaps: 227940

Number of times one read covers 1, 2, 3 or more junctions:

    1 33254992

    2 3124861

    3 52116

    4 913

Total number of times an exon-exon junction is covered: 39664714

Total number of exon-exon junctions covered by at least one read: 209518

--

 

 

 

F)  Definition of TLEN implemented in JAGuaR

 

This program conforms to the Picard spec for TLEN (insert size).

TLEN = distance between 5’ ends

TLEN = my mate's start - my start

If a read is +, its start is its POS. If a read is - (ie. it has the 16 bit), its start is POS + the genomic distance the read covers.

For example, if both reads in a pair are -, the TLEN will be calculated as the distance between the rightmost base of read1 and the rightmost base of read2.

If the reads are a normal forward-reverse proper pair, the TLEN will be calculated as the distance between the leftmost base of read1 and the rightmost base of read2

 

 

 

 

 

 

G)  Example Set

 

A small sample set is available for download and testing:

http://www.bcgsc.ca/downloads/JAGuaR/example

 

PET fastqs:   seq1.fastq.gz, seq2.fastq.gz

Configuration file:  junction.cfg

Transcript annotation file:  GRCh37p13.txt

Genome reference file:  GRCh37-lite.fa.gz

Final bam after JAGuaR run:  bwa_mem_aln.bam

 

1) Download and unzip fastqs and genome fasta file. 

2) Download junction.cfg configuration file and example transcript annotation file.

2) Make sure junction.cfg points to correct path of genome reference and transcript file.

3) Run createJunctions and index the resulting ref.fa reference as per step 2 in "Pipeline".

4) Align PET fastqs with BWA (or BWA-MEM) to ref.ca (generated above) as per step 3 in "Pipeline".

5) Run convertJunctions as per step 4 in "Pipeline".

6) Resulting file should be similar to bwa_mem_aln.bam (also available for download).

 

 

Comments, questions and concerns can be sent to:

ybutterf@bcgsc.ca

 

May, 2014