breakpoint module

class mavis.breakpoint.Breakpoint(chr, start, end=None, orient='?', strand='?', seq=None)[source]

Bases: mavis.interval.Interval

class for storing information about a SV breakpoint coordinates are given as 1-indexed

Parameters:
  • chr (str) – the chromosome
  • start (int) – the genomic position of the breakpoint
  • end (int) – if the breakpoint is uncertain (a range) then specify the end of the range here
  • strand (STRAND) – the strand
  • orient (ORIENT) – the orientation (which side is retained at the break)

Examples

>>> Breakpoint('1', 1, 2)
>>> Breakpoint('1', 1)
>>> Breakpoint('1', 1, 2, '+', 'R')
>>> Breakpoint('1', 1, orient='R')
key
to_dict()[source]
class mavis.breakpoint.BreakpointPair(b1, b2, stranded=False, opposing_strands=None, untemplated_seq=None, data={})[source]

Bases: object

Parameters:
  • b1 (Breakpoint) – the first breakpoint
  • b2 (Breakpoint) – the second breakpoint
  • stranded (bool) – if not stranded then +/- is equivalent to -/+
  • opposing_strands (bool) – are the strands at the breakpoint opposite? i.e. +/- instead of +/+
  • untemplated_seq (str) – seq between the breakpoints that is not part of either breakpoint
  • data (dict) – optional dictionary of attributes associated with this pair

Note

untemplated_seq should always be given wrt to the positive/forward reference strand

Example

>>> BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True)
>>> BreakpointPair(Breakpoint('1', 1, strand='+'), Breakpoint('1', 9999, strand='-'))
breakpoint_sequence_homology(REFERENCE_GENOME)[source]

for a given set of breakpoints matches the sequence opposite the partner breakpoint this sequence comparison is done with reference to a reference genome and does not use novel or untemplated sequence in the comparison. For this reason, insertions will never return any homologous sequence

small duplication event CTT => CTTCTT

GATACATTTCTTCTTGAAAA reference
---------<========== first breakpoint
===========>-------- second breakpoint
---------CT-CT------ first break homology
-------TT-TT-------- second break homology
Parameters:REFERENCE_GENOME (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:
  • str - homologous sequence at the first breakpoint
  • str - homologous sequence at the second breakpoint
Return type:tuple
Raises:AttributeError – for non specific breakpoints
classmethod call_breakpoint_pair(read1, read2=None, REFERENCE_GENOME=None)[source]

calls a set of breakpoints from a single or a pair of pysam style read(s)

Parameters:
Returns:

the newly called breakpoint pair from the contig

Return type:

BreakpointPair

Todo

return multiple events not just the major event

classmethod classify(pair)[source]

uses the chr, orientations and strands to determine the possible structural_variant types that this pair could support

Parameters:pair (BreakpointPair) – the pair to classify
Returns:a list of possible SVTYPE
Return type:list of SVTYPE

Example

>>> bpp = BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True)
>>> BreakpointPair.classify(bpp)
['inversion']
>>> bpp = BreakpointPair(Breakpoint('1', 1, orient='L'), Breakpoint('1', 9999, orient='R'), opposing_strands=False)
>>> BreakpointPair.classify(bpp)
['deletion', 'insertion']

see related theory documentation

copy()[source]
flatten()[source]

returns the key-value self for the breakpoint self information as can be written directly as a TSV row

get_bed_repesentation()[source]
interchromosomal

bool – True if the breakpoints are on different chromosomes, False otherwise

mavis.breakpoint.read_bpp_from_input_file(filename, expand_ns=True, force_stranded=False, **kwargs)[source]

reads a file using the TSV module. Each row is converted to a breakpoint pair and other column data is stored in the data attribute

Parameters:filename (str) – path to the input file
Returns:a list of pairs
Return type:list of BreakpointPair

Example

>>> read_bpp_from_input_file('filename')
[BreakpointPair(), BreakpointPair(), ...]

One can also validate other expected columns that will go in the data attribute using the usual arguments to the TSV.read_file function

>>> read_bpp_from_input_file('filename', cast={'index': int})
[BreakpointPair(), BreakpointPair(), ...]