Exploratory Data Analysis
This section will guide you through cohort comparison analysis or exploratory data analysis for RNA Sequencing Pipelines.
Exploratory Data Analysis
Results of your Cohort Comparison are consolidated into a single analytics interface, where users can visualise their tertiary data analyses.
The Comparison window displays exploratory data analyses in the following tabs.
Overlap
The first Overlap tab is a Venn that shows the number of overlaps between two cohorts. Ensure that there are no overlaps between cohorts before proceeding to principal component analysis.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is an ML-based approach that devolves high-dimensional datasets into lower dimensions (or principal components), so that patterns lost in multidimensional datasets are made visible.
PCA plots are a crucial data analysis step that corrects for internal sample variance, highlighting: outliers, data clusters and other unusual data trends that might otherwise be lost in large data.
PCA plots are made to account for sample variance, where resolving data into 2-3 principal components should generally explain >50-70% of data variance. Increasing the number of PCAs increases the noise in identifying variance (which, in RNA Seq. analysis, correlates to underestimated DGE.)
For example, gene expression data can form clusters that will not be visible in a cohort comparison that includes 10,000s of genes. By resolving the data into its principal components, it is corrected/adjusted for internal sample variance so that ‘true’ differential expressions become visible.
Unusual data patterns include data clusters. Similarly, other data patterns (like outliers, jumps) are made visible in the PCA analysis.
On Quark, the number of PCs available varies from 2–6. Ideally, PCA should account for 50-70% of variance, so that true data trends can be made more visible. In the example above, 2-3 principal components appear best suited for this data, accounting for 70-80% of sample variance.
PCAs can later be adjusted based on DGE plots (volcano plots), through a visual examination of data clustering/outliers.
Differential Gene Expression (DGE)
The crux of RNA Seq. analytical pipelines is DGE and Enrichment Analysis– where researchers can draw insights from how expression patterns vary between two different cohorts. DGE accomplishes two goals:
- identifies the gene(s) differentially expressed and associated with the disease condition;
- provides statistical significance to validate whether the gene(s) identified is relevant or not.
On Quark, the comparative distribution of gene expression between two cohorts is visualised as a volcano plot. Volcano plots are statistical tools that quantify differences in expression fold changes, or logarithmic ratio of abundance, between two cohorts/samples.
Significance values or q-values are calculated based on the distribution. The log fold-changes are calculated, then plotted against q-values to visualise gene expression data as a volcano plot.
Quark provides a dynamic way to change the log2 fold change thresholds. A table lists genes that are significantly differentially expressed in the test cohort, compared against the control arm.
This functional profiling analysis enables researchers to gain early insights into:
- identifying target genes/biomarkers of interest;
- quantifying the magnitude of their fold-change, and;
- assessing the significance of the differences.
Heatmap
The heatmap tab provides a list of genes, and depicts their expression levels in individual samples represented in the cohorts. This allows a direct comparison of the different expression levels between the genes of interest.
Since the number of genes examined runs to 1000s, the search tab allows users to filter to the top genes with highest variance (for example, top 30 or 50 genes with significant differences in heatmap distribution between cohorts).
GenAI-Powered Enrichment Analysis
The Enrichment Analysis tab allows researchers to identify the top diseases, Molecular Pathways and Gene Ontologies (GO) that are enriched in their differentially expressed gene sets.
The Enrichment Analysis tab integrates data from different databases, including GO, Reactome, and DisGeNET.
Select a database from the dropdown menu to view the top enriched terms for each database (e.g. upregulated or downregulated terms for DisGeNET).
Click AI Insights on the top-right corner to instantly access a GenAI-powered summary of all research findings for the enrichment analysis. Select Download Insights to download a text summary, which comprises the following sections:
- Introduction
- Pathways and Function Enrichment
- Gene Sets with Strongest Significance
- Key Genes
- Disease Enrichment
- Key Findings
- Conclusion
GenAI-powered insights exponentially accelerate research findings and instantly highlight key findings from a cohort comparison.
GenAI-Powered Network Graphs
Network Graphs or Knowledge Graphs are graph databases that store data as an interconnected network of nodes.
On Quark, knowledge graphs link genes, proteins, drugs, diseases, pathways and more in a single interconnected network. By transforming multi-dimensional biological data into a cluster representation of nodes, knowledge graphs provide a comprehensive context to interpret potential biomarkers for drug discovery.
On the Network Graph tab, researchers can select between different knowledge graph layouts to visualise the top 5, 10, or 20 genes with the highest log2fold changes, and their disease, pathways, and drug associations.
The right pane allows researchers to select between Upregulated and Downregulated gene sets. In this pane, users can further select different layout types from a drop-down menu (Cose, Breadthfirst, Concentric, Circle, Random).
Users can also select their genes of interest from the drop-down menu under Genes. Additionally, users can choose which Pathways, Molecular Functions, Drugs, and Cellular Components they would like to visualise in their knowledge graphs from individual drop-down menus.
Additional Knowledge Graph Features
Users can avail the following Network Graph features on Quark.
-
Bring Proprietary Data : Researchers can augment publicly available data with their own continuously updated proprietary data to maximise actionable insights.
-
Generate AI-Powered Insights : Clicking the AI Insights tab instantly summarizes and contextualizes research findings.
-
Query Network Graphs using Natural Language : Based on a specific area of research, scientists can query their knowledge graphs using natural language to streamline their findings. Click the Quark icon on the bottom right corner of the page to access an AI chatbot, and input your queries to get tailored responses for generating systematic hypotheses.
-
Create New Cohorts by Searching for Genomic and Clinical Metadata : Scientists can now query their knowledge graphs on Quark to rapidly create cohorts for exploratory analysis from Quark's integrated chatbot. Using genomic and clinical metadata, researchers can identify patient samples that meet their inclusion criteria for addressing specific research questions.