Bioinformatics Methods Lauren Coombe Rene L Warren October 10th, 2017 ASSEMBLY ======== ARCS scaffolding ---------------- The ARCS scaffolder was used to further contiguate the assembly using 10x genomics linked read (version 1.0.1 with ARCS parameters -s 98 -c 5 -l 0 -d 0 -r 0.05 -e 30000 -v 1 -m 20-100000 -z 500 and LINKS parameters –l 5 –a 0.3) (Yeo et al. 2017). Only the NID genotype benefited from ARCS scaffolding. Each draft assembly was compared to the reference three-spine stickleback genome reference (Peichel et al. 2017) using a circos (Krzywinski et al. 2009) pipeline adapted for whole-genome comparisons (https://github.com/JustinChu/JupiterPlot/). The draft sequence of each genotype was ordered and oriented to match the reference genome. Gap-filling ----------- Sealer (version 2.0.1, with parameters -L 150, -j 48, -P 10, -b 50G, -k 115-75 step 10) (Paulino et al., 2015) was used to automatically close gaps in each genotype draft assembly, using corresponding 10x Genomics linked reads. COMPARATIVE GENOMICS ==================== Breakpoint analysis ------------------- Abyss-samtobreak (Jackman et al., 2017) was used to assess the number of breakpoints observed in each assembly prior to gap-filling as compared to the three-spine stickleback genome reference (Peichel et al., 2017). Briefly, the assemblies were first broken at "N"s to generate scaftigs, then the scaftigs were aligned to the stickleback reference genome using BWA mem (version 0.7.15, with parameters -x intractg, -t 4) (Li and Durbin, 2010). Next, the abyss-samtobreak script (-G 400000000) parses the resulting alignments to tabulate contig and scaffold breakpoints. Contig breakpoints occur when a given scaftig does not align to the reference in a single block, while scaffold breakpoints are identified when scaftigs do not align colinearly. Scaffold breakpoints were further categorized by identifying when the scaftigs flanking a scaffold breakpoint mapped to the same chromosome (intra-chromosomal), different chromosomes (inter-chromosomal) and different strands of the reference genome (inversion). Sequence identity analysis -------------------------- To estimate the sequence identity between pairs of stickleback assemblies, we used a k-mer based approach, as previously described (Warren et al., 2015). The approach uses the number of k-mers that are present in both genomes to estimate the sequence identity metric. Briefly, the genome sequences are broken into k-mers, which are loaded into separate Bloom filters. Then, the set k-mer bit intersection of these Bloom filters is measured and used to estimate sequence identity. This analysis was done for each pair of the four draft assemblies, as well as for each draft assembly compared to the stickleback reference. Phylogenetic analyses --------------------- The sequencing methodology employed here was designed at isolating high molecular weight DNA (10x Genomics protocol) and we do not find evidence of mitochondrial sequences in our assembly nor sequencing reads. In absence of a mitochondrion genome, we used eight markers described in the literature (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838417/) (Li et al., 2007) to build a phylogeny for the four genotypes described herein. We first collected the set of eight markers [GenBank: EF032909-EF032921; EF032935-EF033012; EF033026-EF033038] using entrez and aligned the G. aculeatus sequences against each stickleback draft genome using BLAST (Altschul et al., 1990). We then proceeded to extract the nucleotide sequences and build phylogenies (MUSCLE [Edgar, 2004], Mega7 [Kumar et al., 2016]) with that of other species, including outgroups Danio rerio, Morone chrysops and Lycodes atlanticus. BUSCO ----- BUSCO (version 2.0.1, with parameter -m genome) (Simao et al., 2015) was run to assess the completeness of the four draft stickleback assemblies. Using actinopterygii_odb9 as the reference lineage, the completeness of 4584 single-copy nuclear genes in the assemblies was evaluated.