breakpoint module¶

class mavis.breakpoint.Breakpoint(chr, start, end=None, orient='?', strand='?', seq=None)[source]¶

Bases: mavis.interval.Interval

class for storing information about a SV breakpoint coordinates are given as 1-indexed

Parameters:	chr (str) – the chromosome start (int) – the genomic position of the breakpoint end (int) – if the breakpoint is uncertain (a range) then specify the end of the range here strand (STRAND) – the strand orient (ORIENT) – the orientation (which side is retained at the break)

Examples

>>> Breakpoint('1', 1, 2)
>>> Breakpoint('1', 1)
>>> Breakpoint('1', 1, 2, '+', 'R')
>>> Breakpoint('1', 1, orient='R')

key¶

to_dict()[source]¶

class mavis.breakpoint.BreakpointPair(b1, b2, stranded=False, opposing_strands=None, untemplated_seq=None, data={})[source]¶

Bases: object

Parameters:

b1 (Breakpoint) – the first breakpoint
b2 (Breakpoint) – the second breakpoint
stranded (bool) – if not stranded then +/- is equivalent to -/+
opposing_strands (bool) – are the strands at the breakpoint opposite? i.e. +/- instead of +/+
untemplated_seq (str) – seq between the breakpoints that is not part of either breakpoint
data (dict) – optional dictionary of attributes associated with this pair

Note

untemplated_seq should always be given wrt to the positive/forward reference strand

Example

>>> BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True)
>>> BreakpointPair(Breakpoint('1', 1, strand='+'), Breakpoint('1', 9999, strand='-'))

breakpoint_sequence_homology(REFERENCE_GENOME)[source]¶

for a given set of breakpoints matches the sequence opposite the partner breakpoint this sequence comparison is done with reference to a reference genome and does not use novel or untemplated sequence in the comparison. For this reason, insertions will never return any homologous sequence

small duplication event CTT => CTTCTT

GATACATTTCTTCTTGAAAA reference
---------<========== first breakpoint
===========>-------- second breakpoint
---------CT-CT------ first break homology
-------TT-TT-------- second break homology

Parameters:	REFERENCE_GENOME (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name
Returns:	`str` - homologous sequence at the first breakpoint `str` - homologous sequence at the second breakpoint
Return type:	tuple
Raises:	`AttributeError` – for non specific breakpoints

classmethod call_breakpoint_pair(read1, read2=None, REFERENCE_GENOME=None)[source]¶

calls a set of breakpoints from a single or a pair of pysam style read(s)

Parameters:	read1 (pysam.AlignedSegment) – the first read read2 (pysam.AlignedSegment) – the second read
Returns:	the newly called breakpoint pair from the contig
Return type:	BreakpointPair

Todo

return multiple events not just the major event

classmethod classify(pair)[source]¶

uses the chr, orientations and strands to determine the possible structural_variant types that this pair could support

Parameters:	pair (BreakpointPair) – the pair to classify
Returns:	a list of possible SVTYPE
Return type:	`list` of `SVTYPE`

Example

>>> bpp = BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True)
>>> BreakpointPair.classify(bpp)
['inversion']
>>> bpp = BreakpointPair(Breakpoint('1', 1, orient='L'), Breakpoint('1', 9999, orient='R'), opposing_strands=False)
>>> BreakpointPair.classify(bpp)
['deletion', 'insertion']

see related theory documentation

copy()[source]¶

flatten()[source]¶: returns the key-value self for the breakpoint self information as can be written directly as a TSV row

get_bed_repesentation()[source]¶

interchromosomal¶: bool – True if the breakpoints are on different chromosomes, False otherwise

mavis.breakpoint.read_bpp_from_input_file(filename, expand_ns=True, force_stranded=False, **kwargs)[source]¶

reads a file using the TSV module. Each row is converted to a breakpoint pair and other column data is stored in the data attribute

Parameters:	filename (str) – path to the input file
Returns:	a list of pairs
Return type:	`list` of `BreakpointPair`

Example

>>> read_bpp_from_input_file('filename')
[BreakpointPair(), BreakpointPair(), ...]

One can also validate other expected columns that will go in the data attribute using the usual arguments to the TSV.read_file function

>>> read_bpp_from_input_file('filename', cast={'index': int})
[BreakpointPair(), BreakpointPair(), ...]