Obi L. Griffith(1), Erin D. Pleasance(1), Debra L. Fulton(2), Mehrdad Oveisi(1), Martin Ester(3), Asim Siddiqui(1) and Steven J.M. Jones(1)
1. Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada, V5Z 4E6
2. Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6
3. School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6
I) Supplementary Materials
Supplementary methods, results, and figures referred to in the manuscript:
Suppl. materials (WORD doc)
Supplementary methods, results, and figure titles without actual figures inserted (see below for separate figure files):
Suppl. materials WORD doc (text only)
A) Affymetrix HG-U133A oligonucleotide array
- Gene Expression Omnibus (GEO)
- GEO Platform: GPL96
- Full sample set as of April 07, 2004.
- 889 samples (667 with p-value or detection calls)
- 22215 probes
- Affymetrix annotation file used in analysis
- Complete data (log transformed)
- Processed data
- Gene Pair Pearson Correlations
Affymetrix probe intensities were converted to natural log values. Only Affymetrix probe intensities with a 'P' call were considered (p-value < 0.04). Intensities with 'A' or 'M' calls were set to null. Finally, all ln(intensity) values were normalized by subtracting the median and dividing by the inter-quartile range for the experiment (Davidson et al. 2001). Affymetrix probe ids were mapped to LocusLink Ids using the most current Affymetrix annotation file for the HG-U133A chip (www.affymetrix.com). Probes with ambiguous mapping to LocusLink (see SAGE processing below) were discarded resulting in a final set of 8106 genes from the Affymetrix dataset. Genes not common to all three datasets (Affymetrix, cDNA and SAGE) were removed resulting in a final gene set of 5881.
B) cDNA microarray
cDNA microarray data were used as provided by Stuart et al. (2003) except for minor formatting changes. Genes not common to all three datasets (Affymetrix, cDNA and SAGE) were removed resulting in a final gene set of 5881.
- Gene Expression Omnibus (GEO)
- GEO Platform: GPL4
- Full sample set as of March 31, 2004.
- 242 samples
- 609,224 tags
SAGE data was first filtered to remove tags present in less than 10 libraries reducing the unique tags from 609,224 to 87,521. Next, SAGE tags were mapped to genes by the lowest sense tag predicted from Refseq or MGC sequences and then mapped to LocusLink ids using DISCOVERYspace reducing the unique tag set further to 47,263. In the event of discrepancy between Refseq and MGC, the former was taken as correct. If a tag mapped to more than one LocusLink or more than one tag mapped to the same LocusLink it was discarded resulting in a final set of 15,426 unique tags confidently mapped to LocusLink ids. SAGE tag counts of zero were converted to nulls. Non-zero SAGE tag counts were converted to log frequency as follows:
Tag frequency = ln((tag count x 10000)/total tags in library)
III) Distance Calculations and Modifications to C clustering library
A Pearson correlation coefficient was calculated for all possible gene pairs for each platform as a measure of expression similarity. These calculations were performed by a modified version of the C clustering library (De Hoon et al. 2004) on 64-bit opteron linux machines with 8-32GB memory (code available upon request). In all platforms, genes are represented by a vector of expression values for all the experiments in the data set. In each case, genes have null values if not represented on that array (cDNA), no tags observed (SAGE), or intensity not significantly detected (Affymetrix). Thus, when calculating Pearson distances between gene pairs, the number of shared data points varied from zero to the total number of experiments. A minimum number of common experiments (MCE) was required for each gene pair to provide some confidence in the value calculated (e.g. a Pearson distance based on observations from only two experiments is meaningless). This MCE was 95 for Affy, 28 for cDNA and 23 for SAGE.
Open Source Clustering Software Webpage
M.J.L. de Hoon, S. Imoto, J. Nolan, and S. Miyano: Open Source Clustering Software. Bioinformatics, 20 (9): 1453--1454 (2004).
Document explaining modifications
IV) GO Analysis
V) Other recently published coexpression datasets
B) Jensen LJ, Lagarde J, von Mering C, Bork P. 2004. ArrayProspector: a web resource of functional associations inferred from microarray expression data. Nucleic Acids Res. 32(Web Server issue):W445-8.
VI) High-confidence coexpression set
A set of high-confidence coexpressed gene pairs were chosen from the three datasets using the following criteria. These co-expression links are being used by the Genome Sciences Centre Gene Regulation (Informatics) Team to predict human regulatory elements as part of the cisRED project (www.cisred.org). Note: A gene pair may be present in the list multiple times if it passes more than one of the following criteria.
High-confidence Coexpression Criteria (version 1):
- Two-platform combined (2PC) method:
- Minimum Common Experiments (MCE): Affy;cDNA:100, SAGE:25
- 2PC average pearson: r_avg >= 0.65
- TMM method: TMM >= 7
- ArrayProspector method: AP>= 0.7
Downloads13145 co-expressed gene pairs for 2979 genes
VI) Updates to high-confidence coexpression set
As part of our studies of gene regulation the analysis presented above will be updated and expanded to include new expression data, new species, and new analysis methods. Updated high-confidence coexpression sets will be provided on our coexpression resources page.
Davidson GS, Wylie BN, Boyack KW. 2001. Cluster stability and the use of noise in interpretation of clustering. Proc. IEEE Information Visualization 2001, 23-30.
de Hoon MJL, Imoto S, Nolan J, Miyano S. 2004. Open Source Clustering Software. Bioinformatics. 20(9):1453-1454.
Jensen LJ, Lagarde J, von Mering C, Bork P. 2004. ArrayProspector: a web resource of functional associations inferred from microarray expression data. Nucleic Acids Res. 32(Web Server issue):W445-8.
Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14(6):1085-94.
Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science. 302(5643):249-55