cnaseq
cnaseq is a matlab program for inferring copy number segments from short sequence read alignments. The input will contain alignment coordinates for each read, and the output will have segmented the genome into regions of constant copy number state.
Project Description
Setup
This code has been tested with MATLAB R2007a, and R2008b. It does require the use of a MATLAB Statistics Toolbox.
This code should run after unpacking the files. When running, the code will automatically generate the required path, and attempt to compile fwd_back.c. If you cannot get the compile step to work, simply comment out line 46-50 in cnaseq.m.
How to run
cnaseq will compare data from two genomes. The usual scenario is to compare sequence reads from both a tumour sample, and matched normal. However using a software like MAQ (http://maq.sourceforge.net/), you can generate simulated reads from a downloaded reference.
cnaseq will run with the following command:
cnaseq(datadirB,datadirS,winSize,outfileRoot)
Where datadirB contains the read alignment locations from the background, or "reference" sample, datadirS contains the read alignment locations from the sample, ie. "Tumour".
winSize is the number of consective reference read alignments used to define a bin. Typically, a winSize of 200 is sufficient.
outfileRoot is the string that will be the prefix of the output files. The output files will be in datadirS.
Input directories:
Each directory is expected to have a single file per chromosome named: 1.txt 2.txt . . . Y.txt
Where each line in each file contains the alignment position (in bases). It can reduce the memory footprint of the matlab code to pre-sort the alignment locations in each file. An included shell script (split_chrs.sh) might help.
Output
3 main output files will be created in the sample directory:
- _rawcn.txt - the raw copy number and coordinates for each bin
- _segs.txt - the segment coordinates and the copy number state. This is the main output file.
- _cna.txt - Lists the raw copy number along with genomic coordinates of bins and the final copy number state. This file is useful for making plots with plotCopyNumberFromHmmOutput.
The five states are:
- loss,
- neutral
- low-level amplification
- med-level amplification
- high-level amplification
