Graphite Web - Tutorial

Summary

Introduction
Gene Set Analysis Background
A brief introduction of Graphite web implemented tests

Enrichment Analysis
Global test for group of genes
Gene Set Enrichment Analysis (GSEA)
Signalling Pathway Impact Analysis (SPIA)
Pathway analysis through Gaussian Graphical Models (CliPPER)

Sample Analyses

Pathway visualization and Browse
Analysis

Download and results session
Some importatant references for Gene Set Analyses

Introduction

Graphite web is a public web server for the analysis and visualization of biological pathways using high-throughput gene expression data. Specifically, Graphite web implement five different gene set analyses

Hypergeometric enrichment analysis
Gene Set Enrichment Analysis (GSEA) as proposed by Tian et al. (2005)
Signalling Pathway Impact Analysis (SPIA) (Tarca et al. 2009)
Global Test for groups of genes (Goeman et al. 2004)
Pathway analysis through Gaussian Graphical Models (CliPPER) (Martini et al. 2012)

on three organisms:

Homo sapiens
Mus musculus
Drosophila melanogaster

and two pathway databases:

Reactome
KEGG

Gene set analysis background

The grouping of genes into functionally related entities (gene sets) is of great help for interpreting the results obtained from gene expression experiments (microarray or RNA-seq technologies). Then, in recent years, the interest has moved from the study of individual genes to that of groups of genes (defined for instance by biological pathways). The aim of these methods is to identify groups of genes with possibly moderated but coordinated expression changes in different biological conditions, unravelling the complexity of cellular regulatory processes.

Competitive vs. Self-contained
Several gene set methods have been recently developed, both in the univariate and multivariate context. These methods can be divided into two broad categories: (1) methods based on enrichment analysis performed on a list of genes selected through a gene-level test; (2) methods based on global and multivariate approaches that define a model on the whole gene set. These two approaches have been termed competitive and self-contained, respectively (Goeman and Buhlmann, 2007).

In general these two approaches are based on two fundamentally different null hypotheses: the first type hypothesizes the same level of association of a gene set with the given phenotype as the complement of the gene set, the second type only considers the genes within a gene set and hypothesizes that there is no gene in the gene set associated with the phenotype (Tian et al, 2005).

The main drawbacks with competitive methods are (i) the assumption that genes are independent; and (ii) the use of a cut-off threshold for the selection of differentially expressed genes. In this way, many genes with moderate but meaningful expression changes are discarded by the strict cut-off value, which leads to a reduction in statistical power. On the other hand, global and multivariate approaches relax the assumption of independence among genes belonging to the same gene sets and identify possibly moderate, but coordinated, expression changes that cannot be detected by the previous approach without depending from any arbitrary cut-offs. Given the different null hypothesis, in general competitive methods results in smaller lists of significant pathways with respect to self-contained methods.

Topological vs. Non-topological
On a different perspective, gene set analysis can also be divided into topological and non-topological methods. Biological pathways are structured in a network with explicit biological interactions and can be represented through a graph where genes and their connections are, respectively, nodes and edges of the graph. Most of pathway analyses consider pathways as a simple list of genes belonging to the pathway, missing the relevant topological information contained, and are called non-topological. On the other side, topological analyses exploit the structure of connections among genes to test the association between the genes belonging to the pathway and the phenotype.

The reason why the large majority of methods proposed for pathway analyses are non-topological is based on the difficulties in retrieving pathway annotation and converting pathway topology to gene network. Pathway annotation comprises chemical compounds mediating interactions and different types of gene groups (e.g. protein complexes or gene families) that are usually represented as single nodes but whose measures are not available using microarray or RNA-seq data. Then, a process of conversion from a pathway format to a gene-gene network is required to perform topological analysis.

Our group has recently developed graphite (Sales, Calura et al. 2012), a Bioconductor package that taking the information from different databases, interprets pathway format and reconstruct the correspondent gene-gene networks following specific biologically-driven rules. Graphite web exploit the functions of graphite package to perform topological pathway analysis on its networks.

For more details on graphite pathway conversion see Sales, Calura et al. (2012) and click here.

Methods implemented in Graphite web

1. Enrichment Analysis - competitive and non-topological

Enrichment analysis (Fisher Exact test) estimates the chance probability of observing a given number of genes from a particular pathway among the selected differentially expressed genes (DEG). For each pathway a two-way contingency table is generated as follow:

	DEG	EEG	tot
∈ G	n_G,deg	n_G,eeg	N_G
∉ G	n_GC,deg	n_GC,eeg	N_GC
tot	N_deg	N_eeg	N

Where EEG means equally expressed genes, N is the total number of genes in the platform, G is the pathway and GC is the complement of G. N_i and n_i are the frequencies of genes belonging to each table cell.
Then, the probability P of observing at least nG,deg genes of a pathway within a group of N_deg genes is given by:

formula 1

Then, raw p-values are adjusted using Benjamini-Hockberg method (Benjamini and Hockberg 2005).

2. Global test for group of genes - self-contained and non-topological

Global test is based on a penalised logistic regression model. The idea is to find those genes within a pathway whose combination of expression profiles best predicts clinical data (samples class). In this model the dependent variable is the vector of classes and the covariates are the expression profiles of genes belonging to the pathway. In this model the total number of parameters to be estimated is equal to the number of genes in the pathways. Typically with gene expression data the number of samples is small and often happens that the number of samples would be less then the number of genes in a pathway. In this case the model is called non-identifiable, that means that we do not have a sufficient number of replicate (samples) to efficiently estimate all the parameters. To cope with this unbalance structure of the data, Goeman et al. (2004) proposed penalised regression models. In this type of models the coefficient of some genes are shrinked to zero while the remaining ones are estimated.

3. Gene Set Enrichment Analysis (GSEA) - self-contained or independent and non-topological

Gene set enrichment method (GSEA) was originally proposed by (Subramanian et al 2005). The procedure is based on the following steps:

Select a statistic to compare groups of samples (for instance t-test),
Rank the entire list of genes according to the value of the statistic,
Given a pathway G, compare the distribution of the statistic of G and G complement.

The original version of GSEA proposed to compare G and G complement distributions using a weighted version Kolmogorov-Smirnov (K-S) test, where the weights were given by the absolute value of the statistic. The significance of weighted K-S test was estimated through a permutational approach. The Authors proposed a permutation on samples in case of large sample size, and a permutation on genes otherwise. It is worth noting that the use of a permutation approach on samples or on genes changes the null hypothesis leading respectively to a self-contained or to a competitive method. Finally they normalized the weighted K-S test in order to take into account the pathway dimension. However, it was shown (Tian et al. 2005) that the differences in the correlation structure of each pathway could lead to a biased comparison among gene sets unless a normalization procedure is applied. To cope with this problem Tian et al. (2005) proposes the use of the standard statistical approach for comparing mean shift of the G and the G complement distributions: a one sample z-test with a permutational approach.

Given t_i the statistic of the gene i with i=1… N where N is the number of genes in the platform, the two statistics proposed by Tian et al. (2005) are:

formula 2

with its null distributions generated by permuting t_i,…, t_n; and

formula 3

with its null distributions generated by permuting samples.
It is important to note that although the formula for Ek is the same as that of Tk their probability interpretations and hence their testing procedures are quite different. In Tk, t_i is deterministic and the gene set structure is random; in Ek, the opposite is true.

Then after a process of standardization Ek and Tk become z-test and are called NTk and NEk statistics. However the correlation structure in gene sets can still give false positives in NTk; conversely, NEk can still be influenced by the gene set size. Then Tian et al. (2005) suggest taking as good candidates pathways significant either for NTk or NEk.

Graphite web implements Tian et al (2005) GSEA statistics.

4. Signalling Pathway Impact Analysis (SPIA) - mixed and topological

The method proposed by Tarca et al. (2009) calculates a score through the combination of several aspects of the data: the fold change of DEGs, the pathway enrichment score and the topology of signaling pathways. Specifically, from a topological point of view SPIA enhances the impact of a pathway if the DEGs tend to lie near its entry points of a pathway (an entry point id a gene upstream of the pathway).

SPIA needs as input the list of differentially expressed genes with their log fold changes and the complete list of gene names in the platform. Then, SPIA calculates 1) the hypergeometric enrichement p-values, pNDE, (as described before), and 2) a perturbation factor as a linear function of the perturbation factors of all genes in a given pathway, whose significance is calculated through a bootstrap approach, pPERT (note: the parameter of the linear combination are not estimated from the data but are fixed as +/-1 according to the interactions pathway annotations) and 3) the combination of the two independent p-values, pNDE and pPERT, called pG. Adjusted pG were calculated using popular FDR algorithm (Benjamini and Yekutieli, 2001).

According to the sign of the perturbation score the pathway is defined as activated (positive perturbation score = positively perturbed) otherwise the pathway is inhibited (or negatively perturbed).

5. Pathway analysis through Gaussian Graphical Models (CliPPER) - self-contained and topological

CliPPER (Martini et al. 2012) is an empirical method based on Gaussian graphical models that (i) selects pathways with covariance matrices or means significantly different between experimental conditions; and (ii) on such pathways, identifies the portions of the pathways, called signal paths, mostly associated to the phenotype.

Different experimental conditions are usually compared in terms of their gene expression mean differences. However, the difference in mean expression levels does not necessarily result in a change of the interaction strength among genes. Suppose to have in condition 1 transcript A and B, and their mean expression, respectively x_a and x_b. Suppose, now, that in a condition 2 the expression of A and B are both 10 fold increased (10*x_a and 10*x_b). Then, we have a proportional increase of the expression of both A and B. In this case, the mean expressions of A and B will be significantly different between the conditions 1 and 2 but the correlation strength between A and B won't change. In this case, we will have pathways with significant altered mean expression levels but unaltered biological interactions.

On the contrary, if transcripts abundances ratios are altered, we expect a significant alteration not only of their mean expression levels, but also of the strength of their connections, resulting in pathways with completely corrupted functionality. Suppose that in condition 3 the expression of A and B increase without proportionality (say respectively 10 fold and 2 fold), then, we will have that not only the mean but also the correlation will be significantly different in the condition 1 and 3.

Therefore, to look for pathways strongly involved in a biological process, CliPPER is the first method that 1) selects pathways with mean or variance significantly altered and then 2) highlights the portions, called signal paths, of a pathway mostly correlated with the phenotype.

Sample Analyses

Graphite web has two sections. In the Browse section the user can visualize and map a list of genes (possibly coloured according to a fold change), in the Analyse section the user can perform pathway analysis using gene expression matrices.

"Browse" session for gene mapping and pathway visualization

A list of genes provided by the user will be mapped on all the pathways of the selected database.

Step 1: Select the organism
Step 2: Select the database
Step 3: Input files. The list of the genes of interest can be submitted as a file or directly pasted in the text box. The genes in the list will be mapped on all the pathways of the selected database.
In case of file submission, the file must be tab-delimited with two columns: the first column (mandatory) is the gene ID, the second column (optional) is the log Fold Change associated to each gene. If only the expression directions are available (over/under expression) the user can associate to each gene either a +1 or -1. If no information is available, the second column can be omitted and the expression directions of all genes will be automatically set to +1.
In case the list is pasted in the text box, only gene IDs are accepted (one per row) and the expression direction of all genes are set to +1.

The results

The results are divided into two web pages reporting:

the table of all the pathways with al least one gene mapped (coloured according the fold change if provided);
for each pathway the interactive network-based visualization is provided with nodes coloured according to fold change if provided.

Analysis

Step 1: Select the type of analysis. Note that, given the complexity of the analysis, CliPPER is computational intensive, thus it would require several minutes (at last one hour).
Step 2: Select the Organism.
Step 3: Select the database. In this step, the user has the possibility to set the size of the overlap between the expression data and the pathway to be analyzed. by default is set to 10 genes, this means that only pathways with at least 10 genes in common with expression data have taken into consideration for the analysis.
Step 4/5/6: Input files. The input file format will change according to the analysis selected.

For Enrichment score and SPIA, which require Differentially Expressed Genes (DEGs), two procedures are possible, depending if the user has already computed the Differentially Expressed Genes (DEGs):

NO, the user do not have precomputed the DEGs and want automatically compute the list of differentially expressed gene. The normalized expression matrix must be stored in a tab delimited text file; the first row must contain sample names, the sample name represents the sample class; the first column must contain gene IDs. If the matrix contains not available expressions (NAs), those values will be automatically imputed using the K-nearest neighbur algorithm (impute Bioconductor package). The data can be either Microarray or RNA-seq data, the switch "Experimental data source" allows the selection of the method for DEG computation. In case of microarray DEGs are calculated with empirical Bayes test from limma Bioconductor package, otherwise, for RNA-seq, using negative binomial test from edgeR Bioconductor package.
For both these tests, the "Significance level" box sets the False Discovery Rate threshold for the identification of DEGs. Note: In case of RNA-seq count data, only for enrichment analysis, the statistical test widely used to identify differentially expressed genes is based on the negative binomial distribution (accounting for a quadratic dependence between mean and variance). In this case the reads count defines the power of the test and given the strict dependence between read count and gene length, longer genes are characterised by a higher statistical power to be detected as differentially expressed. It has been shown that this differences in length/power if not properly assessed can introduced some bias in the final results. Graphite web allows an optional accounting for this bias using the p-value correction for the gene length as implemented in goseq Bioconductor package.
YES, the user has the list of DEGs and want submit that list for the analyses. The complete list of the genes in the platform must be a plain text file with as many rows as the number of genes. Each line is a gene ID. The list of differentially expressed genes must be a tab-delimited file with two columns: Gene ID; Change in expression level of a gene - log(Fold Change).

Global test, GSEA and CliPPER require a single expression normalized matrix:

Normalized expression matrix. A tab delimited text file; the first row must contain sample names, the sample name represents the sample class; the first column must contain ENTREZGENE IDs. Both microarray and RNAseq data can be submitted

Download sample files for Enrichment score and SPIA and for Global test, GSEA and CliPPER.

Sometime the analysis could take several minutes to complete. Thus, we suggest the user to give us his email to receive an alert when the analysis is completed.

The results

The results are divided into two web pages reporting:

The table of all the significant pathways according to the test selected with their significant p-values;
The interactive network-based visualization of the significant pathways with nodes coloured according to their contribution to the analysis.

In the following we report a detailed description of the results for each method.

Enrichment Analysis

Pathways with adjusted p (qvalues) less than 0.05 are reported in the first table. Then clicking on a pathway, the network-based visualization of the pathway is available as an interactive tool. Grey nodes are genes not available in the platform, white nodes correspond to genes not differentially expressed, and coloured nodes are genes differentially expressed (as reported by the user) with colour proportional to the log fold change, as shown in palette.

Global test for group of genes

A table with all the significant pathways is reported (a significant pathway has qvalue<= 0.05). In the following a detailed description of each column:

qvalue = adjusted p-values of the pathway
statistic = the observed value of the test statistic
expected = the expected value of the test statistic under the null hypothesis (where all the genes do not contribute to the prediction of the clinical classification)
Stdev = the standard deviation of the test statistic under the null hypothesis
geneNum = the total number of covariates (genes) in the model

Then clicking on a pathway, the network-based visualization of the pathway is available as an interactive tool. Grey nodes are genes not available in the platform, colour nodes are the subset of genes that is most responsible for the significant test result, and white nodes are genes whose contribution is negligible. Since the global test statistic on a collection of genes can be seen as a weighted average of the global test statistics for each individual genes, the contribution of each gene is estimated using global test on each individual gene. In Graphite web only genes with individual global test p-value <= 0.05 are coloured proportionally to the value of its statistic, as shown in palette.

Gene Set Enrichment Analysis (GSEA)

A table with all the significant pathways is reported (qvalue<= 0.05). In the following a detailed description of each column:

Set Size = the number of genes in the pathway
Percent up = Percentage of up-regulated genes
NTk Stat = the value of the Ntk statistic (see above for more details)
NTk q-value = the adjusted p-value of the NTk statistic
NTk rank = the ranking of the pathways according to the Ntk statistic
NEK Stat = the value of the NEk statistic (see above for more details)
NEK q-value = the adjusted p-value of the NEk statistic
NEK Rank = the ranking of the pathways according to the Ntk statistic

Then clicking on a pathway, the network-based visualization of the pathway is available as an interactive tool. Grey nodes are genes not available in the platform, coloured nodes represent individual genes with p-value of the t-test statistic <= 0.05 (that contribute mostly to the significance of the pathway) and the colour is proportional to the value of the statistic, as shown in palette.

Signalling Pathway Impact Analysis (SPIA)

A table with all the significant pathways is reported (qG<= 0.05). In the following a detailed description of each column:

pSize = the number of genes in the pathway.
NDE = number of differentially expressed genes in the pathway (as uploaded by the user).
pNDE = hypergeometric probability of observing Nde differentially expressed genes in the pathway by chance.
tA = observed value of the perturbation score.
pPERT = bootstrap probability associated to tA.
pG = combined probability of pNDE and pPERT.
pFDR = adjusted pG using False Discovery Rate correction.
pGFWER = adjusted pG using Family Wise Error Rate (Bonferroni).
STATUS = Inhibition/Activation according to the negative/positive sign of tA.

Then clicking on a pathway, the network-based visualization of the pathway is available as an interactive tool. Grey nodes are genes not available in the platform, white nodes correspond to genes not differentially expressed, and coloured nodes are genes differentially expressed (as reported by the user) with colour proportional to the log fold change uploaded, as shown in palette.

Pathway analysis through Gaussian Graphical Models (CliPPER)

A table with all the significant pathways for the mean and for the concentration matrices is reported (significant pathway must have both p-values <= 0.05). In the following a detailed description of each column:

alphaVar = adjusted p-value of the test on the concentration matrices of the pathway between groups.
alphaMean = adjusted p-values of the test on the means of the pathways.

between groups. Note that according to the value of alphaVar the test on the means is different; if alphaVar is less than 0.05 the test on the means does not assume homoscedasticity.

CliPPER is the first method that given a pathway is able to select the portions of the pathway mostly associated to the phenotype. Then clicking on a pathway Graphite web returns:

The network-based visualization of the pathway as an interactive tool;
A table with the genes belonging to the highest scored paths (defined as connected portions of pathway) mostly associated to the clinical classification.

Clicking on each paths, the genes belonging to the path are highlighted according to their log fold changes, as shown in palette.

Download sessions

For each significant pathway Graphite web allows the download of:

The pdf image of the network with nodes coloured according to their contribution to the analysis.
A txt file with the list of genes (EntrezGene, Symbol, Description and score) belonging to the pathway and used for the analysis (present in the platform).
A txt file with the list of genes (EntrezGene, Symbol, Description) belonging to the pathway but not used for the analysis (not present in the platform).
A SIF file that specifies nodes and interactions. Typically, a SIF format is used to import a network in Cytoscape.

Furthermore the complete results of Graphite web for all the significant pathways can be downloaded as a single zip file. In case of large pathways, the visualization with Cytoscape web will be too heavy, then Graphite web provide the sif file necessary for a in-house visualization.

Some important references

Pathways conversion to gene-network
Sales G, Calura E, Cavalieri D, Romualdi C. (2012) graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics;13:20. PMID:22292714

SPIA
Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R. A systems biology approach for pathway level analysis. (2007) Genome Res.;17:1537-1545. PMID:17785539
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim J-S, Kim CJ, Kusanovic JP, Romero R. (2009) A novel signaling pathway impact analysis. Bioinformatics;25:75-82. PMID:18990722

GSEA
Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34 , 267-273. PMID:12808457
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. (2005) Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. USA;102:13544-13549. PMID:16174746
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: aknowledge-based approach for interpreting genome-wide expression profiles. (2005) Proc. Natl. Acad. Sci. USA;102:15545-15550. PMID:16199517

Global TEST
Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 20(1):93-9. PMID:14693814
Mansmann U, Meister R. (2005) Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach. Methods Inf Med. 44(3):449-53. PMID:16113772

CliPPER
Martini P, Sales G, Massa MS, Chiogna M, Romualdi C. Along signal paths: an empirical gene set approach exploiting pathway topology. Nucleic Acids Res. 2013 Jan 1;41(1):e19. doi: 10.1093/nar/gks866. Epub 2012 Sep 21. PMID:23002139
Massa MS, Chiogna M, Romualdi C. (2010) Gene set analysis exploiting the topology of a pathway. BMC Syst. Biol.;4:121. PMID:20809931

REVIEW AND DISCUSSION
Tamayo P, Steinhardt G, Liberzon A, Mesirov JP. (2012) The limitations of simple gene set enrichment analysis assuming gene independence. Stat Methods Med Res. Oct 14. [Epub ahead of print]. PMID:23070592
Irizarry RA, Wang C, Zhou Y, Speed TP. Gene set enrichment analysis made simple. (2009) Stat Methods Med Res.18(6):565-75. PMID:20048385
Ackermann M, Strimmer K. (2009) A general modular framework for gene set enrichment analysis. BMC Bioinformatics;10:47. PMID:19192285
Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. (2007) Bioinformatics;23:980-987. PMID:17303618
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. (2009) Gene-set analysis and reduction. Brief Bioinform. 10(1):24-34. PMID:18836208