Identifying statistically significant patterns in gene expression data
Motivation: Clustering techniques are routinely applied to identify patterns of co-expression in gene expression data. Co-regulation, and involvement of genes in similar cellular function, is subsequently inferred from the clusters which are obtained. Increasingly sophisticated algorithms have been applied to microarray data, however, less attention has been given to the statistical significance of the results of clustering studies. We present a technique for the analysis of commonly used hierarchical linkage-based clustering called Significance Analysis of Linkage Trees (SALT). Results: The statistical significance of pairwise similarity levels between gene expression profiles, a measure of co-expression, is established using a surrogate data analysis method. We find that a modified version of the standard linkage technique, complete-linkage, must be used to generate hierarchical linkage trees with the appropriate properties. The approach is illustrated using synthetic data generated from a novel model of gene expression profiles and is then applied to previously analysed microarray data on the transcriptional response of human fibroblasts to serum stimulation.