Running Cohort Queries from Workstations

This section walks through setting up your JupyterLab environment and running queries against your approved cohort tables using Amazon Athena.

Dataset Note: The code examples in this guide use mimic_fulldataset as the cohort name — this references the MIMIC demo dataset used for illustration purposes. You must replace the value of COHORT_NAME with the name of your own approved cohort database before running any queries. Your Project Administrator or the Quark TRE interface will provide you with the correct cohort database name.

About the OMOP Common Data Model

Cohort data in Quark TRE is structured according to the OMOP Common Data Model (CDM) — an open community standard designed to standardise the structure and content of observational health data, enabling reliable, reproducible analyses across institutions.

Key OMOP tables you will query include person, condition_occurrence, drug_exposure, visit_occurrence, death, and observation_period. Each table is available as a named database in Amazon Athena, scoped to your approved cohort.

Section 1 — Setup and Installation

Note: This section must be run once per workstation session. No user inputs are required.

This block installs the required Python libraries, configures the AWS connection variables from the workstation's environment, verifies the active AWS identity, and defines the Athena helper functions used throughout the notebook.

# ──── Setup (run once) ────────────

# Installing dependencies
!pip install boto3  # AWS SDK for Python — allows interaction with AWS services like Athena
!pip install pandas

import os
import time
import boto3
import pandas as pd
from datetime import datetime

# Setting up AWS configuration variables
# These are pre-populated from the workstation's environment — no manual input required
AWS_REGION   = os.getenv("AWS_REGION")      # Region where the AWS service is running
OMOP_CATALOG = os.getenv("OMOP_CATALOG")    # Athena catalog where OMOP tables exist
WORKGROUP    = os.getenv("OMOP_WORKGROUP")  # Athena workgroup — governs query permissions
                                            # and acts as a governance layer between Athena
                                            # and this notebook

# Verifying the active AWS identity
# Confirms which AWS Account, Role, and User identity this session is using
sts      = boto3.client("sts")
identity = sts.get_caller_identity()

# ──── Athena Helper Functions ────────────────────────────────────────────────

def start_query(athena, query, catalog, database):
    """Submit a query to Athena and return its execution ID."""
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={"Database": database, "Catalog": catalog},
        WorkGroup=WORKGROUP,
    )
    return response["QueryExecutionId"]


def wait_for_query(athena, execution_id):
    """Poll Athena until the query reaches a terminal state. Returns the final status string."""
    while True:
        response = athena.get_query_execution(QueryExecutionId=execution_id)
        state    = response["QueryExecution"]["Status"]["State"]
        if state == "FAILED":
            reason = response["QueryExecution"]["Status"].get("StateChangeReason", "unknown")
            raise RuntimeError(f"Query failed: {reason}")
        if state == "CANCELLED":
            raise RuntimeError("Query was cancelled.")
        if state == "SUCCEEDED":
            return state
        time.sleep(1)


def fetch_results(athena, execution_id):
    """Retrieve all paginated results from a completed Athena query and return a DataFrame."""
    response = athena.get_query_results(QueryExecutionId=execution_id)
    columns  = [col["Name"] for col in response["ResultSet"]["ResultSetMetadata"]["ColumnInfo"]]

    def parse_rows(rows):
        return [[field.get("VarCharValue", "") for field in row["Data"]] for row in rows]

    rows = parse_rows(response["ResultSet"]["Rows"][1:])  # Skip header row
    while "NextToken" in response:
        response = athena.get_query_results(
            QueryExecutionId=execution_id,
            NextToken=response["NextToken"],
        )
        rows.extend(parse_rows(response["ResultSet"]["Rows"]))
    return pd.DataFrame(rows, columns=columns)


def run_query(query, database, catalog=OMOP_CATALOG):
    """Execute an Athena query and return results as a pandas DataFrame."""
    athena       = boto3.client("athena", region_name=AWS_REGION)
    execution_id = start_query(athena, query, catalog, database)
    wait_for_query(athena, execution_id)
    df = fetch_results(athena, execution_id)
    print(f"\nReturned {len(df)} rows × {len(df.columns)} columns")
    return df

What this block does:

Installs boto3 (AWS SDK for Python) and pandas if not already present.
Reads three environment variables that are automatically injected into the workstation session: AWS_REGION, OMOP_CATALOG, and OMOP_WORKGROUP. These do not need to be set manually.
Calls sts.get_caller_identity() to confirm which AWS role the session is operating under — useful for troubleshooting access issues.
Defines three internal helper functions (start_query, wait_for_query, fetch_results) and one public function (run_query) that you will use throughout the notebook to execute SQL queries and retrieve results as DataFrames.

Setup cell executed successfully

Section 2 — Accessing Approved Cohort Tables

Note: This section requires two user inputs before running: your Cohort Name and the SQL query you wish to execute.

Each approved cohort in Quark TRE corresponds to a named Athena database. You query it by setting COHORT_NAME to the exact database name shown in your TRE project, then writing your SQL against the standard OMOP table names.

Setting Your Cohort Name

Replace the value of COHORT_NAME with the name of your approved cohort database. You can find this name in your Quark TRE project under the cohort details view.

Cohort details view showing database name

Example Query 1 — Querying the Person Table

The person table contains one row per patient and includes demographic fields such as gender, year of birth, race, and ethnicity.

# ──── Update COHORT_NAME and QUERY below, then run this cell ─────────

# Replace with your approved cohort database name
COHORT_NAME = "your_cohort_database_name"

QUERY = """
SELECT * FROM person
"""

person = run_query(QUERY, database=COHORT_NAME)

person.head(5)

A successful run will print the number of rows and columns returned, and display the first five rows of the person table.

Person table query output

Example Query 2 — Querying the Condition Occurrence Table

The condition_occurrence table records all clinical conditions associated with each patient, including condition name, start and end dates, and the source visit.

# ──── Update COHORT_NAME and QUERY below, then run this cell ─────────

# Replace with your approved cohort database name
COHORT_NAME = "your_cohort_database_name"

QUERY = """
SELECT * FROM condition_occurrence
"""

conditions = run_query(QUERY, database=COHORT_NAME)

conditions.head(5)

Condition occurrence table query output

Writing Custom Queries

You can write any valid SQL query against any OMOP table in your cohort database. Common tables available in the OMOP CDM include:

Table	Description
`person`	One row per patient; demographics (gender, year of birth, race, ethnicity)
`condition_occurrence`	Clinical conditions recorded per patient visit
`drug_exposure`	Medication administration records
`visit_occurrence`	Hospital and outpatient visit records
`death`	Date and cause of death where recorded
`observation_period`	The time span over which a patient's data is available

To query any of these tables, update the QUERY variable with your SQL and run the cell:

COHORT_NAME = "your_cohort_database_name"

QUERY = """
SELECT person_id, drug_exposure_start_date, drug
FROM drug_exposure
WHERE drug LIKE '%metoprolol%'
"""

drug_results = run_query(QUERY, database=COHORT_NAME)
drug_results.head(10)

Tip: All queries run through the run_query() function are routed through the Athena workgroup configured for your TRE project, which enforces the access controls and audit logging defined by your Project Administrator. You cannot query tables outside your approved cohort.

What's Next

Run exploratory data analysis — See Running Data Analysis from Workstations to learn how to perform cohort profiling, condition analysis, drug exposure comparison, and survival analysis on top of the data you have queried here.
Download your analysis results — See Downloading Data Results.
Run a bioinformatics pipeline — See Dataset Analysis using Pipelines.