01 July 2015
Performing an analysis of large-scale expression data can be a daunting task, as any relevant information on the effects of the monitored treatment are diluted by tens of thousands of profiles that are left unaffected by the condition change. The Warwick team of CyVerse UK set out to create a number of apps that can get you from normalised expression time course data to concise biological hypotheses on regulatory functionality. It should be noted that these tools were created with array data in mind, but you are welcome to use count data as well. If this is the case, remember to log-transform your data before feeding it into the apps.
The first step of an analysis is typically the identification of differentially expressed genes (DEGs), often coupled with temporal deconstruction to help sequence the order of relevant events. This goes a long way towards reducing the size of the data and helping you focus on the genes that carry the signal you are trying to decipher. GP2S is an algorithm that allows for the detection of differential expression between two conditions of a time course, and also features an extension that allows for the detection of the time when the gene becomes differentially expressed. If your time course data only features one condition, then the gradient tool is a more fitting method of identifying the timing of events as it is created with a single condition time course dataset in mind. The method can also be potentially used for DEG identification by deeming the genes that get picked up by the method as changing at some point in time to be differentially expressed.
Once in possession of a differentially expressed gene list, further size reduction can be done by performing clustering and biclustering. Algorithms belonging to those families aim to split the data up into groups exhibiting related behaviour. BHC is a hierarchical clustering algorithm that requires minimal user input, and can be used successfully on both time course and static data. TCAP allows a different clustering experience, as the method uses a more complex similarity measure and produces intricate clusters that can feature regulatory interactions extending beyond mere co-regulation. If you’re in possession of multiple time course datasets (such as different conditions), you can try Wigwams, which will mine them for modules of genes co-regulated across at least two of the provided datasets.
Gene groups obtained from the above methods can then be mined for relevant biological information, putting the detected expression trends into context. Two common forms of enrichment analysis are GO terms and transcription factor binding sites. In terms of GO term analysis, all the clustering/biclustering methods support the Cytoscape plugin BiNGO, with one of the output files serving as direct input for BiNGO’s overrepresentation analysis. Two CyVerse apps allow for the analysis of transcription factor binding sites – MEME-LaB performs de novo mining, detecting novel overrepresented motifs, while HMT screens promoters for known transcription factor binding sites. Once again, input files compatible with the apps are provided on output from the clustering/biclustering apps.
Another potential analysis is the identification of underlying regulatory networks. Such models, inferred based on transcription factor expression levels, show the signalling chains of transcription factors, helping put the observed downstream co-regulatory events captured by clusters/biclusters into context. CSI proposes a model based on how good a job upstream transcription factors’ expression profiles do of explaining the downstream transcription factors’ expression profiles at a later time point, and the resulting model can be turned into a Cytoscape-friendly network by applying a stringency threshold on the confidence in each edge. Extensions of the algorithm exist – hCSI can infer related networks across multiple datasets, while oCSI can worth with data captured from different species.
29 June 2015
The following set of tools were integrated into the Warwick Discovery Environment hosted at Warwick for testing.
Details of these tools can be found at our github repositories (link).
The tools are are focusing on at Warwick’s Systems Biology Department focus on identifying denovo regulatory networks from genetic datasets. A number of tools are provided that have been developed within the department aimed at extracting information relevant to plant biologists from modern large genomic datasets.
Gene Expression Timeseries
One set of tools aims to deal with gene expression profiles over time, with each tool useful for inferring different aspects of the underlying biology. These tools can be composed into a useful workflows allowing large amounts of less informative data to be reduced down before being fed into more complex analyses and models.
The first of these tools is affectionally known as “Gradient Tool” and it attempts to determine whether anything interesting is happening in the data, that is are the data showing any dynamic changes in expression over time. This tool is quick to run and can hence be applied to many thousands of genes in small amounts of time, and is therefore good for initial filtering of data before being fed into more complicated models. Scientific details are available here: http://www.plantcell.org/content/23/3/873.long
The second tool commonly used is one known as “GP2S” and it is used to detect differential gene expression, i.e. given a control and treatment do they respond “differently” and if so is this a temporally isolated response. This tool takes more resources to run and more human judgement is needed of the output so we tend to run this after Gradient Tool. Scientific details are available here: doi:10.1089/cmb.2009.0175
The interesting genes discovered by Gradient Tool and GP2S can then be fed into a tool called CSI (for Causal Structural Inference) or HCSI (a hierarchical variant when one has data from more than one related species) in order to discover causal relationships between genes present in the data. This is a very computationally intensive method and can therefore only be applied to a few hundred genes. Scientific details of CSI can be found in doi:10.1098/rsfs.2011.0053 and for HCSI in doi:10.1093/bioinformatics/bts222
An alternative approach to inferring causal relationships can be found in the Wigwams tool. It is less computationally intensive and can therefore be run on much larger numbers of genes, meaning less filtering needs to be performed a priori. Scientific details available here: doi:10.1093/bioinformatics/btt728
Tools for Integration
The current list of tools is as follows:
- VBSSM: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors.
- MVBSSM: a randomised variant of the above, for performing sensitivity analysis
- MemeLab: motif analysis in clusters.
- Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data.
Depending on how long these tools take to convert, other tools may be incorporated into iPlant.
The current deadlines/milestones are as follows:
|Gradient Tool||In Progress, SM||Jul ’15|
|Apples||In Progress, BG||Nov ’15|
|CSI||In Progress, SM||Feb ’16|
The developer responsible for each tool is indicated by their initials; SM = Sam Mason, KP=Krzysztof Polanski, BP=Bo Gao.