Novel miRNA detection
Identification of novel miRNA genes
Current release
Novel miRNA detection 1.0
Released May 25, 2009
First release of code to identify novel miRNAs.
More about this release…
-
Get
Novel miRNA detection
for
all platforms
- Novel miRNA detection code
Project Description
Novel miRNA pipeline as used in Morin et al., Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells., Genome Research Apr, 2008.
Author: Ryan D. Morin
Following is a list of packages that are required for the pipeline; please make sure to install and test all packages listed below before running the code. After this, please read the pipeline summary and then run the example.
Package requirements:
|
Perl 5.8 (tested on ActiveState Perl) |
http://www.perl.org/ |
|
DBD-mysql * |
http://search.cpan.org/dist/DBD-mysql/lib/Bundle/DBD/mysql.pm |
|
BioPerl |
http://www.bioperl.org/wiki/Main_Page |
|
Ensembl API |
http://www.ensembl.org/info/docs/api/api_installation.html |
|
Vienna package (RNALfold) |
http://www.tbi.univie.ac.at/RNA/ |
|
R |
http://www.r-project.org/ |
|
R package ‘e1071’ |
http://www.potschi.de/svmtut/svmtut.html |
* Successful installation requires a version of MySQL installed
Pipeline Summary:
There are 3 scripts to run:
find_N_fold_seeds.pl
summarize_structures.pl
new_svm_for_microRNAs.R
Before running these, locations for various tools have to be configured:
profile.sh ensures you have the
correct perl to use. Add the path to your perl to PATH to make sure you're using
a perl with the DBI module and Ensembl adaptor.
Please remember to source the file:
source profile.sh.
detection_params.cfg specifies the
location of the RNALfold and RNAfold binaries, and Ensembl connection
parameters if you have a local Ensembl mirror.
If Ensembl connection parameters are not set, the script will default to
ensembldb.ensembl.org using the anonymous account.
Scripts:
First step for detection is to run find_N_fold_seeds.
It has 2 inputs:
1) a gff file with a set of locations of suspected miRNAs (an example is
provided).
2) an ensembl adaptor database (eg. Human/Mouse). The ensembl tutorial at http://www.ensembl.org/info/docs/api/core/core_tutorial.html
provides an example script to list all valid names that can be supplied.
The results of this script is output to STDOUT, so run the script this way:
find_N_fold_seeds.pl file.gff Human
> human.out
Next, run the summarize_structures
script with human.out. The input is the
file, and a "class" string.
For novel miRNA detection, the class will always be "unknown".
For generating a file to train the SVM,
create a .gff of known miRNA locations, run it through find_N_fold_seeds, then
call this script with the output and the class "miRNA". Do the same for a .gff of known non-miRNA
locations, and use class "not_miRNA"
This script should be sent to a file named to_classify.dat (the output must be
called “to_classify.dat”):
summarize_structures human.out unknown
> to_classify.dat
Finally, run the R script that reads to_classify.dat (from the working
directory)
R
CMD BATCH new_svm_for_microRNAs.R
The R script uses the trained SVM
new_svm.Rdata.
The output of the R script is "predicted_classes.out" and contains a flag
"microRNA" or "not_microRNA" for each structure in your
.nice file, and these will correspond to the original list of candidates in the
bed/GFF file
Since these steps require no additional parameters, the steps can be wrapped up
in a shell script:
#/bin/sh
find_N_fold_gff=$1
find_N_fold_species=$2
find_N_fold_seeds $1 $2 > temp.txt
summarize_structures temp.txt unknown > to_classify.dat
/path/to/R CMD BATCH new_svm_for_microRNAs.R
rm -f temp.txt
rm -f to_classify.dat
Example use of code:
1) find_N_fold_seeds.pl mirbase_13.gff Human > find_N_fold_ensembl.out
2) summarize_structures.pl find_N_fold_ensembl.out unknown > to_classify.dat
3)
~/R/R
CMD BATCH new_svm_for_microRNAs.R
For help with setup/installation, please contact Yaron Butterfield or Andy Chu.
