Detailed methods for the ALEXA custom microarray design platform

Array Design Method

Following is a brief description of the process by which the ALEXA platform is used to create a custom array design. For a more detailed description of how to use ALEXA, refer to the user manual or the platform code itself.

Database population

A single database instance is created for each species and initially populated with information from EnsEMBL (using the EnsEMBL API). This database will store all probe sequences and the results of all quality tests for these probes. Each probe will also be annotated by creating associations with genes, transcripts, exons, protein motifs, transmembrane domains, etc. For each probe, coordinates are provided to describe the position of the probe relative to the gene sequence as well as the chromosome it resides on. Refer to the database schema for more details. Once the database is populated with gene models, basic statistics are generated to describe the number of transcripts per gene, the size of transcripts and exons, the number of known exon-skipping events, etc. Refer to the pre-computed designs section to see examples of this.

Repeat Masking

To help avoid probes that correspond to repetitive sequences the complete gene sequence of every imported gene is repeat masked with updated repeat libraries from the Genetic Information Research Institute. A probe not allowed if more than 25% of its sequence is masked.

Probe Sequence Extraction

Probes are extracted for all known, non-pseudo EnsEMBL genes. Five probe types are generated: Exon, Intron, Exon-Junction, Exon-Boundary and Random Negative control (See Intro section for description of these). Probe extraction can also be specified for predicted genes and this is recommended for genomes with preliminary levels of annotation where most genes are still considered to be predicted. The user can also specify to use an isothermal or fixed-length design. In an isothermal design, the length of each probe is allowed to vary to achieve a specified target Tm. This approach is recommended for users who will have their arrays manufactured by NimbleGen Systems Inc. Balancing the Tm of probes across the array potentially reduces the variable performance of probes with dramatically different GC content which are all hybridized at a single hybridization stringency. As probes are extracted the Tm of each probe is calculated by a nearest neighbor method and simple tests are conducted to avoid ambiguous bases and repetitive elements.
Exon and intron probes are selected by tiling at 5 bp intervals. Exon-junction probes are generated for every possible valid exon-exon connection and Exon-boundary probes are generated for all exons. Random negative control probes are randomly generated sequences which are selected to uniformly cover the range of probe lengths and Tm of the experimental probes (Exon, Intron, Exon-Junction and Exon-Boundary).

Probe Quality Testing

All probe sequences are subjected to the following quality tests in an effort to identify unsuitable probes:
(1) Probe Tm. The melting temperature of each probe sequence is calculated using a 'Nearest Neighbor' method. For an isothermal design, probes with a Tm greater than 3.5 degrees C from the target Tm will not be used.
(2) Low complexity test. The 'mdust' algorith is used to identify low complexity regions (homopolymers for example). Generally a probes with a low complexity region longer than 7-9 bp are considered unsuitable.
(3) Folding. All probe sequences are folded using the Simfold and PairFold algorithms from the RNAsoft package. Simfold is used to identify probes with form strong hairpins. A probe with a hairpin minimum free energy of less than -7.0 kcal/mol will not be used. Pairfold is used to identify probes which form strong duplexes with themselves (this interferes with their ability to hybridize with their target). Probes with a PairFold minimum free energy of less than -19.5 will not be used.
(4) Specificitity. All probe sequences are tested for specificity to the target gene by searcing for sequence similarity in appropriate sequence databases. To ensure that each probe is likely to only bind to sequences corresponding to its target sequence, probes are tested for similarity to all available mRNAs, ESTs, EnsEMBL transcripts, other probe sequences within the design, other regions within the target gene and the entire genome sequence. Generally, a probe will be failed if a significant blast hit that is longer 50-85% of its length is found (allowed length varied depending on probe type, sequence database and species).
Note: each of the quality thresholds described above are the defaults used for pre-computed designs but can be modified by the user when creating a custom design.

Probe Filtering

Using the quality metrics described in the previous steps a filtered probe set is selected from the complete probe population. Probes which fail the quality tests above are excluded at this time. At this time the 'best' n probes are selected for each exon or intron (where n is the desired probeset size). In the pre-computed designs 3 probes are selected for each probeset.

Summarizing and Visualizing an ALEXA Design

Once the filtering and selection of probes for each gene is complete, the entire design is summarized. Various statistics are generated describing the thermodynamic properties and specificity of the probes as well as the success of probe design for each gene (probe coverage %), etc. An annotation file summarizing the features of each gene such as the number of transcripts, exons, probes selected, etc. is also generated and provides links to custom UCSC tracks which display the genomic position of all probes for a particular gene.

Design Validation and Submission

A brief manual validation of ALEXA designs can be accomplished by visualizing probe coordinates and sequence in the custom UCSC tracks described above for a small number of genes of interest. Once the comfortable with the design the user generates a submission file and sends it to an array synthesis facility. This can be done in-house in the case of spotted oligo arrays or by a custom array service such as those provided by NimbleGen.
Note: for complex species such as Human, current custom array densities do not allow the inclusion of all possible probes for all known genes. Increased densities are expected in the near future, until then, you may have to limit your design to genes with fewer exons or a functional subset of genes.