11 January 2013 Rene L. Warren -A summary of the TCR/BCR sequence analysis can be found in the tcr and bcr directories, respectively -Data files have been placed in subdirectories whose name match patient samples. For sample breakdown/information, please refer to samples.tsv -In each,there are two files: 1)trackContigsBCRCANDIDATESwithJ.csv (or trackContigsTCRCANDIDATESwithJ.csv for TCR analyses) 2)insert_contigBCRCANDIDATESwithJ.sql (or insert_contigBCRCANDIDATESwithJ.sql for TCR analyses) The first file, trackContigs*CRCANDIDATESwithJ.csv is a comma separated file (csv) that summarizes TCR or BCR annotation for EACH INDIVIDUAL AMPLICON. Hence, each record (line) represents a unique BCR/TCR amplicon sequence. (Amplicon sequence determined by Illumina MiSeq sequencing and paired-reads co-assembled). The second file insert_contig*CRCANDIDATESwithJ.sql is a structured query language (sql) file whose format is somewhat similar to csv. It is a higher-level analysis of information found in the previous (trackContigs*.csv) file and each line of that file represents a unique rearrangement. I have not filtered out any information, so out-of-frame, low depth and ambiguous rearrangements are listed in the file. See below on how to filter your dataset. The TCR/BCR clonotype data is organized in MySQL (.sql) format Each line correspond to a unique, unfiltered clonotype Each column of the sql file consists of: +---------------+-----------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+-----------------------+------+-----+---------+----------------+ | id | int(10) unsigned | | PRI | NULL | auto_increment | | FK_run__id | int(10) unsigned | | MUL | 0 | | | ntSeq | varchar(200) | YES | MUL | NULL | | | depth | mediumint(8) unsigned | | | 0 | | | rearrangement | varchar(200) | YES | MUL | NULL | | | ntCDR3 | varchar(100) | | | | | | aaCDR3 | varchar(100) | | | | | | ntSeqShort | varchar(200) | YES | | NULL | | | aaSeqShort | varchar(200) | YES | | NULL | | | ntSeqLong | varchar(250) | YES | | NULL | | | aaSeqLong | varchar(250) | YES | | NULL | | | vName | varchar(15) | YES | | NULL | | | vDeleted | smallint(5) unsigned | YES | | NULL | | | vEnd | smallint(5) unsigned | YES | | NULL | | | vFrame | smallint(5) unsigned | YES | | NULL | | | jName | varchar(15) | YES | | NULL | | | jDeleted | smallint(5) unsigned | YES | | NULL | | | jRestSeq | varchar(200) | YES | | NULL | | | frameCheck | smallint(5) unsigned | | | 0 | | | vPossible | mediumint(8) unsigned | YES | | NULL | | | ntTCRB | text | | | | | | aaTCRB | text | | | | | +---------------+-----------------------+------+-----+---------+----------------+ The first field in the .sql file == column 0 vName with the pipe separator "|" indicates that more than one possible V could be assigned, because the amplicon did not capture V unambiguously. This is reflected in the "rearrangement" field as well. For unambiguous V assignments, make sure you search for vPossible==1 (column 19) If you are interested only in in-frame rearrangements, look for frameCheck=1 (column 18). If interested in rearrangements having a depth of 2 or more (depth representing the number of amplicons that harbour the same rearrangement/clonotype), set depth >= 2 (column 3)