file_io module¶

module which holds all functions relating to loading reference files

mavis.annotate.file_io.convert_tab_to_json(filepath, warn=<function devnull>)[source]¶

given a file in the std input format (see below) reads and return a list of genes (and sub-objects)

column name	example	description
ensembl_transcript_id	ENST000001
ensembl_gene_id	ENSG000001
strand	-1	positive or negative 1
cdna_coding_start	44	where translation begins relative to the start of the cdna
cdna_coding_end	150	where translation terminates
genomic_exon_ranges	100-201;334-412;779-830	semi-colon demitited exon start/ends
AA_domain_ranges	DBD:220-251,260-271	semi-colon delimited list of domains
hugo_names	KRAS	hugo gene name

Parameters:	filepath (str) – path to the input tab-delimited file
Returns:	a dictionary keyed by chromosome name with values of list of genes on the chromosome
Return type:	`dict` of `list` of `Gene` by `str`

Example

>>> ref = load_reference_genes('filename')
>>> ref['1']
[Gene(), Gene(), ....]

Warning

does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3

mavis.annotate.file_io.load_annotations(filepath, warn=<function devnull>, reference_genome=None, filetype=None, best_transcripts_only=False)[source]¶

loads gene models from an input file. Expects a tabbed or json file.

Parameters:	filepath (str) – path to the input file verbose (bool) – output extra information to stdout reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extenstion
Returns:	lists of genes keyed by chromosome name
Return type:	`dict` of `list` of `Gene` by `str`

mavis.annotate.file_io.load_masking_regions(filepath)[source]¶

reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns

chr: the chromosome
start: start of the region, 1-based inclusive
end: end of the region, 1-based inclusive
name: the name/label of the region

For example:

#chr    start   end     name
chr20   25600000        27500000        centromere

Parameters:	filepath (str) – path to the input tab-delimited file
Returns:	a dictionary keyed by chromosome name with values of lists of regions on the chromosome
Return type:	`dict` of `list` of `BioInterval` by `str`

Example

>>> m = load_masking_regions('filename')
>>> m['1']
[BioInterval(), BioInterval(), ...]

mavis.annotate.file_io.load_reference_genes(*pos, **kwargs)[source]¶: Deprecated Use load_annotations() instead

mavis.annotate.file_io.load_reference_genome(filename, low_mem=False)[source]¶

Parameters:	filename (str) – the path to the file containing the input fasta genome
Returns:	a dictionary representing the sequences in the fasta file
Return type:	`dict` of `Bio.SeqRecord` by `str`

mavis.annotate.file_io.load_templates(filename)[source]¶

primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given

name
start
end
band_name
giemsa_stain

for example

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25

Parameters:	filename (str) – the path to the file with the cytoband template information
Returns:	list of the templates loaded
Return type:	`list` of `Template`

mavis.annotate.file_io.parse_annotations_json(data, reference_genome=None, best_transcripts_only=False, warn=<function devnull>)[source]¶: parses a json of annotation information into annotation objects