protein module

class mavis.annotate.protein.Domain(name, regions, translation=None, data=None)[source]

Bases: object

  • name (str) – the name of the domain i.e. PF00876
  • regions (list of DomainRegion) – the amino acid ranges that are part of the domain
  • transcript (Transcript) – the ‘parent’ transcript this domain belongs to

AttributeError – if the end of any region is less than the start


>>> Domain('DNA binding domain', [(1, 4), (10, 24)], transcript)
align_seq(input_sequence, reference_genome=None, min_region_match=0.5)[source]

align each region to the input sequence starting with the last one. then take the subset of sequence that remains to align the second last and so on return a list of intervals for the alignment. If multiple alignments are found, then raise an error

  • input_sequence (str) – the sequence to be aligned to
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • min_region_match (float) – percent between 0 and 1. Each region must have a score len(seq) * min_region_match

tuple contains

  • int: the number of matches
  • int: the total number of amino acids to be aligned
  • list of DomainRegion: the list of domain regions on the new input sequence

Return type:


  • AttributeError – if sequence information is not available
  • UserWarning – if a valid alignment could not be found or no best alignment was found
get_seqs(reference_genome=None, ignore_cache=False)[source]

returns the amino acid sequences for each of the domain regions associated with this domain in the order of the regions (sorted by start)

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:list of amino acid sequences for each DomainRegion
Return type:list of str
Raises:AttributeError – if there is not enough sequence information given to determine this

tuple: a tuple representing the items expected to be unique. for hashing and comparing


compares the sequence in each DomainRegion to the sequence collected for that domain region from the translation object

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:tuple contains
  • int: the number of matching amino acids
  • int: the total number of amino acids
Return type:tuple of int and int

Translation – the Translation this domain belongs to

class mavis.annotate.protein.DomainRegion(start, end, seq=None, domain=None, name=None)[source]

Bases: mavis.annotate.base.BioInterval

class mavis.annotate.protein.Translation(start, end, transcript=None, domains=None, seq=None, name=None)[source]

Bases: mavis.annotate.base.BioInterval

describes the splicing pattern and cds start and end with reference to a particular transcript

  • start (int) – start of the coding sequence (cds) relative to the start of the first exon in the transcript
  • end (int) – end of the coding sequence (cds) relative to the start of the first exon in the transcript
  • transcript (Transcript) – the transcript this is a Translation of
  • domains (list of Domain) – a list of the domains on this translation
  • sequence (str) – the cds sequence
Parameters:pos (int) – the amino acid position
Returns:the cdna equivalent (with CODON_SIZE uncertainty)
Return type:Interval
Parameters:pos (int) – the cdna position
Returns:the protein/amino-acid position
Return type:int
Raises:AttributeError – the cdna position is not translated

converts a genomic position to its cds (coding sequence) equivalent

Parameters:pos (int) – the genomic position
Returns:the cds position (negative if before the initiation start site)
Return type:int

converts a genomic position to its cds (coding sequence) equivalent using hgvs cds notation

Parameters:pos (int) – the genomic position
Returns:the cds position notation
Return type:str


>>> tl = Translation(...)
# a position before the translation start
>>> tl.convert_genomic_to_cds_notation(1010)
# a position after the translation end
>>> tl.convert_genomic_to_cds_notation(2031)
# an intronic position
>>> tl.convert_genomic_to_cds_notation(1542)
>>> tl.convert_genomic_to_cds_notation(1589)

converts a genomic position to its cds equivalent or (if intronic) the nearest cds and shift

Parameters:pos (int) – the genomic position
  • int - the cds position
  • int - the intronic shift
Return type:tuple of int and int
get_aa_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:the amino acid sequence
Return type:str
Raises:AttributeError – if the reference sequence has not been given and is not set
get_cds_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:the cds sequence
Return type:str
Raises:AttributeError – if the reference sequence has not been given and is not set
get_seq(reference_genome=None, ignore_cache=False)[source]

wrapper for the sequence method

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name

see structural_variant.annotate.base.BioInterval.key()


Transcript – the spliced transcript this translation belongs to

mavis.annotate.protein.calculate_orf(spliced_cdna_sequence, min_orf_size=None)[source]

calculate all possible open reading frames given a spliced cdna sequence (no introns)

Parameters:spliced_cdna_sequence (str) – the sequence
Returns:list of open reading frame positions on the input sequence
Return type:list of Interval