What is GS-LAGE?

GS-LAGE is a global analysis strategy for large-scale microarray datasets integrated from public databases such as NCBI GEO. The availability of a large collection of microarray data from public databases provides a rich resource for the physiome-wide analysis of gene expression on a genome scale. However, the microarray datasets available from public databases have been contributed to by many different research groups using a variety of experimental and data-processing methods. Thus, heterogeneity of among groups of microarray datasets should be minimized by additional standardization before utilizing the integrated microarray datasets.

We have rescaled and integrated heterogeneous microarray datasets from the NCBI GEO by using two-step gene-oriented standardization procedure. Thus, GS-LAGE provides insights into gene-specific global expression trends in various microarray samples and improves the statistical confidence of the marginal changes in gene expression between experiments by prioritizing genes with greater selective expression to specific tissues or experiment conditions over many other differentially expressed genes.


Currently available microarray datasets from GS-LAGE

Microarray datasets sharing the same platforms were collected from the NCBI GEO. Currently, we have constructed two independent pools of datasets consisting of microarray sample arrays created using Affymetrix GeneChip Human HG-U133A Array chips (i.e., GPL96) and Affymetrix GeneChip Human HG-U133 Plus 2.0 Array chips (i.e., GPL570). We are currently expanding the integrated microarray datasets to generalize the "gene-specific interpretation of transcriptome data" to diverse organisms and experimental conditions. Thus, please revisit our webserver to explore more diverse gene-specific transcriptome datasets in the near future.


Rescaling and integration of large-scale transcriptome data

It is assumed that each sample array in the GEO was submitted as normalized data. However, in order to integrate the large-scale heterogeneous microarray datasets with same platform, each sample array must be re-standardized to meet a common scale. Thus, each sample was first z-transformed by using the average and standard deviation in each sample array,



where xi represents the expression value of the ith probe in the given sample array, µGSM represents the average expression value of all probes in the given sample array, and σGSM represents the standard deviation of all probes in the given sample array. ui is the standardized expression value of the ith probe in the given sample array.

To minimize the heterogeneous properties within each collection of microarray datasets due to systematic bias in diverse experimental conditions, we applied the quantile normalization technique, which creates equal distributions of probe intensities for all sample arrays within the same pool of datasets.
(Find details in Bolstad et al., Bioinformatics, 2003)


Intrinsic properties of gene expression

Based on the two collection of large-scale integrated datasets (i.e., the GPL96 and GPL570 datasets), we observed the calculated average and standard deviation of individual gene expressions are highly correlated between the GPL96 and GPL570 datasets (r=0.94 and 0.84, respectively). This implies that the calculated DB-wide average and standard deviation of a gene expression are intrinsically determined properties of each gene (i.e., the gene-specific average expression and standard deviation).




        (A)


        (B)

Relationships between average gene expression and standard deviation among different pools of datasets. (A) The average expression of individual genes for the GPL96 and GPL570 datasets. (B) The standard deviation of gene expression for the GPL96 and GPL570 datasets. (Kim et al., 2009, Submitted)



GS-LAGE score and gene-specific DEG

Since different genes have intrinsically different expression levels and variability for their own biological functions, gene-specific behaviors in the expression data should be considered in prioritizing genes of biological significance. Thus, we rescaled the intensity signal of the expression data (ui) of ith gene by using the DB-wide average expression (µi) and standard deviation (σi) of the gene,



Thus, vi is the rescaled expression levels for GS-LAGE. With GS-LAGE scores, similar expression data among genes can be differentially represented depending on the gene's average expression level and variability.


In addition, it is possible to compute a gene-specific DEG (Differential Expression of Gene) value from vi values of two compared samples,



ui+ : the expression level of the ith gene in the control sample
ui-: the expression level of the ith gene in the target sample.


Conventional DEG analyses only focus on the interpretation of the observed change in mRNA expression levels between control and target samples. However, gene-specific DEG incorporates the consideration of gene-specific behaviors in transcriptional activity into the DEG analysis to emphasize the biological significance of the transcriptional activity of a gene in the target samples. It is based on the notion that a small DEG value for a gene with a low σ may be more biologically significant than a large DEG value with a large σ, given that σ represents the intrinsic variability of the gene's expression.


Go to top