“In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.” —

class mavis.blat.Blat[source]

Bases: object

static millibad(row, is_protein=False, is_mrna=True)[source]

this function is used in calculating percent identity direct translation of the perl code #

static percent_identity(row, is_protein=False, is_mrna=True)[source]
static pslx_row_to_pysam(row, bam_cache, reference_genome)[source]

given a ‘row’ from reading a pslx file. converts the row to a BlatAlignedSegment object

  • row (dict of str) – a row object from the ‘read_pslx’ method
  • bam_cache (BamCache) – the bam file/cache to use as a template for creating reference_id from chr name
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
static read_pslx(filename, seqid_to_sequence_mapping, is_protein=False, verbose=True)[source]
static score(row, is_protein=False)[source]

direct translation from ucsc guidelines on replicating the web blat score

below are lines from the perl code i’ve re-written in python

my $sizeMul = pslIsProtein($blockCount, $strand, $tStart, $tEnd, $tSize, $tStarts, $blockSizes);
sizmul = 1 for DNA
my $pslScore = $sizeMul * ($matches + ($repMatches >> 1) ) - $sizeMul * $misMatches - $qNumInsert - $tNumIns