file_io module

module which holds all functions relating to loading reference files

mavis.annotate.file_io.convert_tab_to_json(filepath, warn=<function devnull>)[source]

given a file in the std input format (see below) reads and return a list of genes (and sub-objects)

column name example description
ensembl_transcript_id ENST000001  
ensembl_gene_id ENSG000001  
strand -1 positive or negative 1
cdna_coding_start 44 where translation begins relative to the start of the cdna
cdna_coding_end 150 where translation terminates
genomic_exon_ranges 100-201;334-412;779-830 semi-colon demitited exon start/ends
AA_domain_ranges DBD:220-251,260-271 semi-colon delimited list of domains
hugo_names KRAS hugo gene name
Parameters:filepath (str) – path to the input tab-delimited file
Returns:
a dictionary keyed by chromosome name with
values of list of genes on the chromosome
Return type:dict of list of Gene by str

Example

>>> ref = load_reference_genes('filename')
>>> ref['1']
[Gene(), Gene(), ....]

Warning

does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3

mavis.annotate.file_io.load_annotations(filepath, warn=<function devnull>, REFERENCE_GENOME=None, filetype=None, best_transcripts_only=False)[source]

loads gene models from an input file. Expects a tabbed or json file.

Parameters:
  • filepath (str) – path to the input file
  • verbose (bool) – output extra information to stdout
  • REFERENCE_GENOME (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extenstion
Returns:

lists of genes keyed by chromosome name

Return type:

dict of list of Gene by str

mavis.annotate.file_io.load_masking_regions(filepath)[source]

reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns

  • chr: the chromosome
  • start: start of the region, 1-based inclusive
  • end: end of the region, 1-based inclusive
  • name: the name/label of the region

For example:

#chr    start   end     name
chr20   25600000        27500000        centromere
Parameters:filepath (str) – path to the input tab-delimited file
Returns:a dictionary keyed by chromosome name with values of lists of regions on the chromosome
Return type:dict of list of BioInterval by str

Example

>>> m = load_masking_regions('filename')
>>> m['1']
[BioInterval(), BioInterval(), ...]
mavis.annotate.file_io.load_reference_genes(*pos, **kwargs)[source]
mavis.annotate.file_io.load_reference_genome(filename, low_mem=False)[source]
Parameters:filename (str) – the path to the file containing the input fasta genome
Returns:
a dictionary representing the sequences in the
fasta file
Return type:dict of Bio.SeqRecord by str
mavis.annotate.file_io.load_templates(filename)[source]

primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given

  1. name
  2. start
  3. end
  4. band_name
  5. giesma_stain

for example

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
Parameters:filename (str) – the path to the file with the cytoband template information
Returns:list of the templates loaded
Return type:list of Template
mavis.annotate.file_io.parse_annotations_json(data, REFERENCE_GENOME=None, best_transcripts_only=False, warn=<function devnull>)[source]

parses a json of annotation information into annotation objects