Skip to content

Running Cohort Queries from Workstations

This guide will walk you through setting up your JupyterLab environment and running queries against your approved cohort tables using Amazon Athena.

Dataset Note: The code examples in this guide use mimic_fulldataset as the cohort name for illustration purposes. Replace it with the name of your own approved cohort database before running any queries.

On Quark, researchers can query OMOP-standardised tables from workstations. The following section provides a brief background about the OMOP Common Data Model.

About the OMOP Common Data Model

Cohort data in Quark TRE is structured according to the OMOP Common Data Model (CDM) — an open community standard designed to standardise the structure and content of observational health data, enabling reliable, reproducible analyses across institutions.

Key OMOP tables you will query include person, condition_occurrence, drug_exposure, visit_occurrence, death, and observation_period. Each table is available as a named database in Amazon Athena, scoped to your approved cohort.

Dataset Schema

The entity-relationship diagram below shows the full table schema used in this cohort dataset, including all primary and foreign key relationships between OMOP tables. Use this as a reference when writing JOINs or exploring how tables like person, visit_occurrence, and condition_occurrence relate to one another.

Note: The schema diagram is best viewed by opening it in a new browser tab (right-click → Open image in new tab), where you can zoom and pan freely.

OMOP Table Schema

Standardised Vocabularies

The OMOP CDM uses a set of standardised vocabularies to represent clinical concepts consistently across domains. Each domain (Condition, Drug, Procedure, Measurement, etc.) maps source codes to a preferred standard vocabulary — for example, SNOMED for conditions, RxNorm for drugs, and LOINC for measurements — while retaining the original source codes for traceability. The table below summarises which vocabularies are used for standard, source, and classification concepts in each domain.

Domain Standard Concepts Source Concepts Classification Concepts
Condition SNOMED, ICDO3 SNOMED Veterinary MedDRA
Procedure SNOMED, CPT4, HCPCS, ICD10PCS, ICD9Proc, OPCS4 SNOMED Veterinary, HemOnc, NAACCR None at this point
Measurement SNOMED, LOINC SNOMED Veterinary, NAACCR, CPT4, HCPCS, OPCS4, PPI None at this point
Drug RxNorm, RxNorm Extension, CVX HCPCS, CPT4, HemOnc, NAACCR ATC
Device SNOMED Others, currently not normalized None at this point
Observation SNOMED Others None at this point
Visit CMS Place of Service, ABMT, NUCC SNOMED, HCPCS, CPT4, UB04 None at this point

For a deeper understanding of how vocabularies, concepts, and mappings work together, refer to the Standardised Vocabularies chapter in The Book of OHDSI. For detailed field-level documentation of each OMOP table — including which fields are required, their data types, and what each column represents — see the CDM v5.4 specification.


Running Cohort Queries from Workstations

Pre-requisites

  • For running cohort queries, ensure you have access to an approved cohort (Login to Quark TRE → Navigate to Datasets → Click the Cohorts tab → the Cohort dashboard should show your Approved cohorts). If you do not have access to an approved cohort, follow this guide on Requesting Dataset Access.
  • Ensure you have created, connected, launched and logged in to your Ubuntu or Windows workstation, as shown in the Workstation Overview.

Section 1 — Setup and Installation

Note: This section must be run once per workstation session. No user inputs are required.

This block installs the required Python libraries, configures the AWS connection variables from the workstation's environment, verifies the active AWS identity, and defines the Athena helper functions used throughout the notebook.

Installing Dependencies

Install boto3 (AWS SDK for Python) and pandas. The install command differs by operating system:

Ubuntu / macOS:

!pip install boto3
!pip install pandas

Windows:

import sys
!{sys.executable} -m pip install boto3 pandas

AWS Configuration and Athena Helpers

The following code is identical across all platforms. Paste it into a cell after the install block and run it once.

import os
import time
import boto3
import pandas as pd
from datetime import datetime

# Setting up AWS configuration variables
# These are pre-populated from the workstation's environment — no manual input required
AWS_REGION   = os.getenv("AWS_REGION")      # Region where the AWS service is running
OMOP_CATALOG = os.getenv("OMOP_CATALOG")    # Athena catalog where OMOP tables exist
WORKGROUP    = os.getenv("OMOP_WORKGROUP")  # Athena workgroup — governs query permissions
                                            # and acts as a governance layer between Athena
                                            # and this notebook

# Verifying the active AWS identity
# Confirms which AWS Account, Role, and User identity this session is using
sts      = boto3.client("sts")
identity = sts.get_caller_identity()

# ──── Athena Helper Functions ────────────────────────────────────────────────

def start_query(athena, query, catalog, database):
    """Submit a query to Athena and return its execution ID."""
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={"Database": database, "Catalog": catalog},
        WorkGroup=WORKGROUP,
    )
    return response["QueryExecutionId"]


def wait_for_query(athena, execution_id):
    """Poll Athena until the query reaches a terminal state. Returns the final status string."""
    while True:
        response = athena.get_query_execution(QueryExecutionId=execution_id)
        state    = response["QueryExecution"]["Status"]["State"]
        if state == "FAILED":
            reason = response["QueryExecution"]["Status"].get("StateChangeReason", "unknown")
            raise RuntimeError(f"Query failed: {reason}")
        if state == "CANCELLED":
            raise RuntimeError("Query was cancelled.")
        if state == "SUCCEEDED":
            return state
        time.sleep(1)


def fetch_results(athena, execution_id):
    """Retrieve all paginated results from a completed Athena query and return a DataFrame."""
    response = athena.get_query_results(QueryExecutionId=execution_id)
    columns  = [col["Name"] for col in response["ResultSet"]["ResultSetMetadata"]["ColumnInfo"]]

    def parse_rows(rows):
        return [[field.get("VarCharValue", "") for field in row["Data"]] for row in rows]

    rows = parse_rows(response["ResultSet"]["Rows"][1:])  # Skip header row
    while "NextToken" in response:
        response = athena.get_query_results(
            QueryExecutionId=execution_id,
            NextToken=response["NextToken"],
        )
        rows.extend(parse_rows(response["ResultSet"]["Rows"]))
    return pd.DataFrame(rows, columns=columns)


def run_query(query, database, catalog=OMOP_CATALOG):
    """Execute an Athena query and return results as a pandas DataFrame."""
    athena       = boto3.client("athena", region_name=AWS_REGION)
    execution_id = start_query(athena, query, catalog, database)
    wait_for_query(athena, execution_id)
    df = fetch_results(athena, execution_id)
    print(f"\nReturned {len(df)} rows × {len(df.columns)} columns")
    return df

What this block does:

  • Reads three environment variables that are automatically injected into the workstation session: AWS_REGION, OMOP_CATALOG, and OMOP_WORKGROUP. These do not need to be set manually.
  • Calls sts.get_caller_identity() to confirm which AWS role the session is operating under — useful for troubleshooting access issues.
  • Defines three internal helper functions (start_query, wait_for_query, fetch_results) and one public function (run_query) that you will use throughout the notebook to execute SQL queries and retrieve results as DataFrames.

Section 2 — Querying Your Cohort

Note: Querying your cohort requires two user inputs before running: your Cohort Name and the SQL query you wish to execute.

Run queries by setting COHORT_NAME in the following code-chunk to the exact dataset name of your approved cohort (e.g., mimic_fulldataset).

Example Query 1 — Querying the Person Table

The person table contains one row per patient and includes demographic fields such as gender, year of birth, race, and ethnicity.

# ──── Update COHORT_NAME and QUERY below, then run this cell ─────────

# Replace with your approved cohort database name
COHORT_NAME = "mimic_fulldataset"

QUERY = """
SELECT * FROM person
"""

person = run_query(QUERY, database=COHORT_NAME)

person.head(5)

Person-query-output

Example Query 2 — Querying the Condition Occurrence Table

The condition_occurrence table records all clinical conditions associated with each patient, including condition name, start and end dates, and the source visit.

# ──── Update COHORT_NAME and QUERY below, then run this cell ─────────

# Replace with your approved cohort database name
COHORT_NAME = "mimic_fulldataset"

QUERY = """
SELECT * FROM condition_occurrence
"""

conditions = run_query(QUERY, database=COHORT_NAME)

conditions.head(5)

Condition-query-output

Writing Custom Queries

You can write any valid SQL query against any OMOP table in your cohort database. Common tables available in the OMOP CDM include:

Table Description
person One row per patient; demographics (gender, year of birth, race, ethnicity)
condition_occurrence Clinical conditions recorded per patient visit
drug_exposure Medication administration records
visit_occurrence Hospital and outpatient visit records
death Date and cause of death if present in patient records
observation_period The time span over which a patient's data is available

To query any of these tables, update the QUERY variable with your SQL and run the cell:

COHORT_NAME = "mimic_fulldataset"

QUERY = """
SELECT person_id, drug_exposure_start_date, drug
FROM drug_exposure
WHERE drug LIKE '%metoprolol%'
"""

drug_results = run_query(QUERY, database=COHORT_NAME)
drug_results.head(10)

Custom-query-output

Optional - Visualise Your Results

Install Visualisation Libraries

The install command differs by operating system:

Ubuntu / macOS:

!pip install matplotlib

Windows:

!{sys.executable} -m pip install matplotlib

Plotting and Saving Results

The plotting code is the same across platforms. The only difference is the file save path.

import pandas as pd
import matplotlib.pyplot as plt

# Query the person table
QUERY_person = """ SELECT * FROM person """
person = run_query(QUERY_person, database=COHORT_NAME)

# Plot gender distribution
person.gender.value_counts().plot(kind="bar")
plt.xticks(rotation=45, fontsize=8)

When saving the plot to disk, use the path format for your operating system:

Ubuntu / macOS:

plt.savefig("/home/ubuntu/Desktop/gender_profile.png", dpi=300)

Windows:

plt.savefig("C:\\Users\\wsadmin\\Desktop\\gender_profile.png", dpi=300)

What's Next

  • Download your plots and results — See Downloading Data Results for the file download workflow from the Results tab.
  • Run further exploratory data analysis — See a tutorial on Cohort Analysis via Ubuntu Workstations to learn how to perform cohort profiling, condition analysis, drug exposure comparison, and survival analysis on top of the data you have queried here.