Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home Platforms Bioinformatics GSC Software Centre Novel miRNA detection

Novel miRNA detection

Identification of novel miRNA genes

Current release
Novel miRNA detection 1.0

Released May 25, 2009

First release of code to identify novel miRNAs.
More about this release…

Download file Get Novel miRNA detection for all platforms
Novel miRNA detection code

Project Description

Novel miRNA pipeline as used in Morin et al., Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells., Genome Research Apr, 2008. 

Author:  Ryan D. Morin

 

Following is a list of packages that are required for the pipeline; please make sure to install and test all packages listed below before running the code.  After this, please read the pipeline summary and then run the example. 

Package requirements:

Perl 5.8 (tested on ActiveState Perl)

http://www.perl.org/
http://www.activestate.com/activeperl/

DBD-mysql *

http://search.cpan.org/dist/DBD-mysql/lib/Bundle/DBD/mysql.pm

BioPerl

http://www.bioperl.org/wiki/Main_Page

Ensembl API

http://www.ensembl.org/info/docs/api/api_installation.html

Vienna package (RNALfold)

http://www.tbi.univie.ac.at/RNA/

R

http://www.r-project.org/

R package ‘e1071’

http://www.potschi.de/svmtut/svmtut.html

* Successful installation requires a version of MySQL installed

 

Pipeline Summary:

There are 3 scripts to run:

find_N_fold_seeds.pl
summarize_structures.pl
new_svm_for_microRNAs.R

Before running these, locations for various tools have to be configured:
profile.sh ensures you have the correct perl to use. Add the path to your perl to PATH to make sure you're using a perl with the DBI module and Ensembl adaptor.  Please remember to source the file:  source profile.sh.
detection_params.cfg specifies the location of the RNALfold and RNAfold binaries, and Ensembl connection parameters if you have a local Ensembl mirror.  If Ensembl connection parameters are not set, the script will default to ensembldb.ensembl.org using the anonymous account.

Scripts:
First step for detection is to run find_N_fold_seeds. It has 2 inputs:
1) a gff file with a set of locations of suspected miRNAs (an example is provided).
2) an ensembl adaptor database (eg. Human/Mouse). The ensembl tutorial at http://www.ensembl.org/info/docs/api/core/core_tutorial.html provides an example script to list all valid names that can be supplied.

The results of this script is output to STDOUT, so run the script this way:
find_N_fold_seeds.pl file.gff Human > human.out

Next, run the summarize_structures script with human.out.  The input is the file, and a "class" string.  For novel miRNA detection, the class will always be "unknown".

For generating a file to train the SVM, create a .gff of known miRNA locations, run it through find_N_fold_seeds, then call this script with the output and the class "miRNA".  Do the same for a .gff of known non-miRNA locations, and use class "not_miRNA"

This script should be sent to a file named to_classify.dat (the output must be called “to_classify.dat”):
summarize_structures human.out unknown > to_classify.dat

Finally, run the R script that reads to_classify.dat (from the working
directory)
R CMD BATCH new_svm_for_microRNAs.R
The R script uses the trained SVM new_svm.Rdata.
The output of the R script is "predicted_classes.out" and contains a flag "microRNA" or "not_microRNA" for each structure in your .nice file, and these will correspond to the original list of candidates in the bed/GFF file

Since these steps require no additional parameters, the steps can be wrapped up in a shell script:

#/bin/sh

find_N_fold_gff=$1
find_N_fold_species=$2
find_N_fold_seeds $1 $2 > temp.txt
summarize_structures temp.txt unknown > to_classify.dat
/path/to/R CMD BATCH new_svm_for_microRNAs.R
rm -f temp.txt

rm -f to_classify.dat

 

Example use of code:

1)    find_N_fold_seeds.pl mirbase_13.gff Human > find_N_fold_ensembl.out

2)    summarize_structures.pl find_N_fold_ensembl.out unknown > to_classify.dat

3)    ~/R/R CMD BATCH new_svm_for_microRNAs.R

 

 For help with setup/installation, please contact Yaron Butterfield or Andy Chu.