Exploratory Data Analysis
This section will guide you through cohort comparison analysis or exploratory data analysis for RNA Sequencing Pipelines.
Exploratory Data Analysis
Results of your Cohort Comparison are consolidated into a single analytics interface, where users can visualise their tertiary data analyses.
The Comparison window displays exploratory data analyses in the following tabs.
Overlap
The first Overlap tab is a Venn that shows the number of overlaps between two cohorts. Ensure that there are no overlaps between cohorts before proceeding to principal component analysis.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is an ML-based approach that devolves high-dimensional datasets into lower dimensions (or principal components), so that patterns lost in multidimensional datasets are made visible.
PCA plots are a crucial data analysis step that corrects for internal sample variance, highlighting: outliers, data clusters and other unusual data trends that might otherwise be lost in large data.
PCA plots are made to account for sample variance, where resolving data into 2-3 principal components should generally explain >50-70% of data variance. Increasing the number of PCAs increases the noise in identifying variance (which, in RNA Seq. analysis, correlates to underestimated DGE.)
For example, gene expression data can form clusters that will not be visible in a cohort comparison that includes 10,000s of genes. By resolving the data into its principal components, it is corrected/adjusted for internal sample variance so that ‘true’ differential expressions become visible.
Unusual data patterns include data clusters. Similarly, other data patterns (like outliers, jumps) are made visible in the PCA analysis.
On Quark, the number of PCs available varies from 2–6. Ideally, PCA should account for 50-70% of variance, so that true data trends can be made more visible. In the example above, 2-3 principal components appear best suited for this data, accounting for 70-80% of sample variance.
PCAs can later be adjusted based on DGE plots (volcano plots), through a visual examination of data clustering/outliers.
Differential Gene Expression (DGE)
The crux of RNA Seq. analytical pipelines is DGE and Enrichment Analysis– where researchers can draw insights from how expression patterns vary between two different cohorts. DGE accomplishes two goals:
- identifies the gene(s) differentially expressed and associated with the disease condition;
- provides statistical significance to validate whether the gene(s) identified is relevant or not.
On Quark, the comparative distribution of gene expression between two cohorts is visualised as a volcano plot. Volcano plots are statistical tools that quantify differences in expression fold changes, or logarithmic ratio of abundance, between two cohorts/samples.
Significance values or q-values are calculated based on the distribution. The log fold-changes are calculated, then plotted against q-values to visualise gene expression data as a volcano plot.
Quark provides a dynamic way to change the log2 fold change thresholds. A table lists genes that are significantly differentially expressed in the test cohort, compared against the control arm.
This functional profiling analysis enables researchers to gain early insights into:
- identifying target genes/biomarkers of interest;
- quantifying the magnitude of their fold-change, and;
- assessing the significance of the differences.
Heatmap
The heatmap tab provides a list of genes, and depicts their expression levels in individual samples represented in the cohorts. This allows a direct comparison of the different expression levels between the genes of interest.
Since the number of genes examined runs to 1000s, the search tab allows users to filter to the top genes with highest variance (for example, top 30 or 50 genes with significant differences in heatmap distribution between cohorts).
Enrichment Analysis
The ‘Enrichment Analysis’ tab allows researchers to identify the diseases, Molecular Pathways and Gene Ontologies (GO) that are enriched in their differentially expressed gene sets.
For example, if a specific signalling pathway associated with inflammation is enriched, or genes related to the transcription of a specific protein associated with tumorigenesis is over-represented, researchers can easily visualise and download the enriched GO terms to draw further insights about their samples.
This tab integrates data from different databases, such as the Kyoto Encyclopaedia of Genes and Genomes Pathway database (KEGG Pathway database) and Gene Set Enrichment Analysis (GSEA) database.
Genes are clustered according to their ontologies and the enrichment analysis displays significant differences in gene expression classifications between the two cohorts.