RLW/LC Oct. 10, 2017 Each genotype's sequence assembly was subjected to ARCS scaffolding (Yeo et al. 2017), re-using the 10x Genomics linked reads to order and orient sequences further. However, only NID (atlantic fresh, see below) benefited from this scaffolding methodology. The scaffold N50 length increased from 1.5 to ~2.2 Mbp It is unclear why, but in our experience, limited ARCS scaffolding is often attributable to 1) the size of original molecules/quality of the HMW DNA/quality of 10x libraries 2) contiguity of the draft assembly prior to scaffolding 3) ability to further improve a supernova assembly that already, optimally, used the long-range information. We did order and orient each sequence in relation to the (Peichel et al. 2017) reference genome. Note that we opted to list the sequences in the order they align, as distinct sequences (instead of concatenating them into each chromosome), as the order/orientation is relative to a draft and may not be reflect the actual genome structure of each individual genotype. Accordingly, the sequences are arranged by chromosome, listed in order (5'->3'). We named each sequence as such: chromosome_scaffoldID (eg. >I_scaffold6386). Unplaced sequences are named : chrUn_scaffoldXXXX. Note: The mitochondrion genomes of all four genotypes is not found in the individual supernova assemblies. We looked for read evidence by targeted assembly, scanning the sequencing reads for Mt sequences but in vain. Re-arranged* drafts *sequences from assemblies arranged in order and orientation to match the most recent (Peichel et al. 2017) reference genome Stats (all sequences) n n:500 L50 min N80 N50 N20 E-size max sum name 25381 25379 227 511 50443 427495 1155216 694160 4339705 412.8e6 BAMfinal.fa (pacific marine) 18457 18457 332 508 65381 302384 758318 486187 3808012 415.5e6 BOTfinal.fa (pacific freshwater) 13880 13879 82 636 409487 1461796 3079872 1884232 7213217 424.1e6 NID (atlantic freshwater BEFORE ARCS) 13758 13757 50 636 517052 2169369 5629607 3082520 11.43e6 424.9e6 NIDfinal.fa (atlantic freshwater AFTER ARCS) 14497 14496 141 768 203203 835017 1756536 1043468 4398663 421.2e6 SYLfinal.fa (atlantic marine) 23 23 10 15742 17.25e6 19.91e6 28.99e6 21.36e6 33.31e6 446.6e6 Gac-HiC_revised_genome_assembly.fa (Peichel et al 2017) Stats (sequences assigned to chromosomes only) n n:500 L50 min N80 N50 N20 E-size max sum name 21536 21534 216 511 71507 450240 1172930 710760 4339705 403.1e6 BAM_NG120final_unmerged_assigned.fa 14859 14859 316 508 76613 320467 765286 497710 3808012 405.8e6 BOT_NG120final_unmerged_assigned.fa 10673 10672 48 636 598958 2195626 5629607 3138186 11.43e6 417.3e6 NID_NG120final_unmerged_assigned.fa 11429 11428 136 768 233721 906474 1768447 1064422 4398663 412.8e6 SYL_NG120final_unmerged_assigned.fa 21 21 9 15.11e6 17.25e6 19.91e6 28.99e6 21.39e6 33.31e6 426e6 Gac-HiC_revised_genome_assembly_chromosomesOnly.fa Stats (unassigned sequences) n n:500 L50 min N80 N50 N20 E-size max sum name 3845 3845 820 812 1493 3062 7104 5815 60504 9721049 BAM_NG120final_unmerged_unassigned.fa 3598 3598 725 866 1573 3411 8190 7105 64930 9760311 BOT_NG120final_unmerged_unassigned.fa 3085 3085 704 851 1500 2854 5676 10307 148057 7561048 NID_NG120final_unmerged_unassigned.fa 3068 3068 588 821 1585 3347 10249 8979 105175 8362262 SYL_NG120final_unmerged_unassigned.fa 1 1 1 20.6e6 20.6e6 20.6e6 20.6e6 20.6e6 20.6e6 20.6e6 Gac-HiC_revised_genome_assembly_unassigned.fa Gap Sealing Results: Genotype # gaps pre-Sealer # gaps closed F1-BAM 20298 6703 (33.02%) F1-SYL 20239 6967 (34.42%) Wild-NID 20322 7431 (36.57%) Wild-BOT 18817 5226 (27.77%) ABySS-Bloom - Whole-genome comparison of sequence identity based on shared kmers Resulting sequence identities: F1-BAM F1-SYL Wild-BOT Wild-NID F1-BAM - 99.26 +/- 3.114e-06 99.22 +/- 4.868e-06 99.25 +/- 2.490e-06 F1-SYL - - 99.56 +/- 3.130e-06 99.79 +/- 8.367e-07 Wild-Bot - - - 99.54 +/- 2.950e-06 Wild-NID - - - - ABySS-samtobreak - Breakpoint analysis Summary of breakpoints: Genotype # contig bps # scaffold bps Total # bps # intra-chrom bps # inter-chrom bps #inversions (intra-chrom) F1-BAM 4416 113 4529 75 38 51 F1-SYL 4828 151 4979 81 70 58 wild-BOT 4444 96 4540 54 42 28 wild-NID 4842 315 5157 202 113 164 *Note: inter/intra chromosomal breakpoints are categorizing the scaffold breakpoints only. *Note: Inversions determined by looking at pairs of scaftigs that result in a scaffold breakpoint, and counting as inversion of the scaftigs align to different strands of the reference. Interesting inter-chromosomal breakpoints: parsed the abyss-samtobreak output to look for inter-chromosomal breakpoints that are common in the different genotype assemblies. I required that the positions of the breakpoints for each chromosome were within 500bp of each other in different genotypes. NOTE: The coordinates listed correspond to the coordinate on the reference that is closest to the breakpoint between the scaftigs, and it still a mapped position. (Many of the cases have scaftigs which end in soft-clipping) Genotypes ChrA ChrA pos ChrB ChrB pos consecutive contigs? wild-BOT, wild-NID, F1-SYL III 3382821 XII 20768631 no wild-NID, F1-SYL, F1-BAM VIII 18886913 II 21773900 yes wild-NID, F1-SYL XII 20768011 II 23605015 no wild-NID, F1-SYL XII 366662 (NID), 366523 (SYL) X 17550209 yes wild-NID, F1-SYL XIV 150222 (NID), 150208 (SYL) VI 1211916 no wild-NID, F1-SYL XV 372884 (NID), 372997 (NID) IV 24676305 yes samtobreak does filtering of the alignments prior to calling breakpoints, so a scaffold breakpoint could be called based on scaftigs that are not originally consecutive. Ie. if you have contig1_0, contig1_1, contig1_2, and contig1_1's alignment is filtered out, contig1_0 and contig1_2 are now considered to be a pair Interesting intra-chromosomal breakpoints: Genotypes Chrom Breakpoint Pos 1 Breakpoint Pos 2 Inversion? Wild-BOT, Wild-NID, F1-BAM IX 19634361 19668228 Yes Wild-NID, F1-BAM, F1-SYL XX 1562850 1712884 (NID), 1712868 (BAM), 1712877 (SYL) Yes Wild-NID, F1-SYL, F1-BAM XIII 16813415 (NID), 16813361 (SYL), 16813197 (BAM) 16887461 (NID), 16887037 (SYL), 16886943 (BAM)* Yes Wild-BOT, F1-SYL XVII 1187872 (BOT), 1187550 (SYL) 1311629 (BOT), 1311786 (SYL) Yes Wild-NID, F1-SYL IV 1069874 1175530 Yes Wild-NID, F1-SYL XIII 20494941 (NID), 20494820 (SYL) 20497350 (NID), 20497467 (SYL) Yes *:For this intra-chromosomal breakpoint, F1-BAM and Wild-NID vary by 518bp in the coordinate of the second breakpoints position – Just out of the accepted range for pairwise-comparisons. Interestingly, each of these common intra-chromosomal breakpoints were inversions, and again we see that Wild-NID and F1-SYL share the most intra-chromosomal breakpoints. BUSCO: BUSCO can be used to assess assembly completeness by assessing the assembly of a set of expected single-copy genes (Bsed on near-universal single-copy orthologs). The reference lineage used was: actinopterygii Genotype Complete BUSCOs (C) Complete and single-copy BUSCOs (S) Complete and duplicated BUSCOs (D) Fragmented BUSCOs (F) Missing BUSCOs (M) Total BUSCO groups searched F1-BAM 3952 3844 108 357 275 4584 F1-SYL 4274 4179 95 191 119 4584 Wild-BOT 4260 4179 81 188 136 4584 Wild-NID 4158 4059 99 263 163 4584 Reference (Peichel et al. 2017) 4387 4288 99 90 107 4584