Running Data Analysis from Workstations

Overview

This guide walks through performing exploratory clinical data analysis inside a Quark TRE workstation using JupyterLab. It follows on from Running Cohort Queries from Workstations, which covers how to set up your environment and query approved cohort tables.

The analysis workflow demonstrated here is built around a clinical use case comparing two AFib treatment cohorts — patients receiving Metoprolol only versus patients receiving Metoprolol + Amiodarone — using data structured in the OMOP Common Data Model. The workflow covers cohort identification, demographic profiling, condition analysis, drug exposure comparison, length of stay, and Kaplan-Meier survival analysis.

Dataset Note: The code examples in this guide use test_12march_125pm_mimic_fulldataset as the cohort name — this references the MIMIC demo dataset used for illustration purposes only. The MIMIC demo dataset contains simulated electronic health records for 100 ICU patients, and any results or visualisations shown here should not be interpreted as clinically valid findings. Replace the value of COHORT_NAME with your own approved cohort database name before running any code.

Prerequisites

Before beginning this guide, ensure the following:

Your workstation is in a Running state and you are connected via the browser. See Running Cohort Queries from Workstations for setup instructions.
You have already run Section 1 (Setup & Installation) from the cohort queries guide in your current JupyterLab session. The run_query function and AWS environment variables must be active.
COHORT_NAME is set to your approved cohort database name.

OMOP Table Reference

The analyses in this guide draw on six OMOP tables. All are queried using run_query() against your approved cohort database.

Table	Description
`person`	One row per patient; demographics (gender, year of birth, race, ethnicity)
`condition_occurrence`	Clinical conditions recorded per patient visit
`drug_exposure`	Medication administration records
`visit_occurrence`	Hospital and outpatient visit records
`death`	Date and cause of death where recorded
`observation_period`	The time span over which a patient's data is available

Section 3.2 — Analysis Setup

Note: Run this block once before any other analysis cell. No user inputs are required.

This block installs the visualisation and statistics libraries used throughout the analysis, imports all required modules, and sets the default visual style.

# Installing packages
!pip install lifelines        # Survival analysis (Kaplan-Meier, log-rank test)
!pip install seaborn          # Statistical data visualisation
!pip install matplotlib-venn  # Venn diagram plotting

# Importing libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
import seaborn as sns
from matplotlib_venn import venn2
from scipy.stats import mannwhitneyu
from lifelines.statistics import logrank_test

# Set visual style
sns.set_theme(style="whitegrid")

[SCREENSHOT: JupyterLab showing the analysis setup cell after successful execution with no errors]

Section 3.3 — Reading the Required OMOP Tables

This block loads all six OMOP tables into pandas DataFrames in a single pass. It reuses the COHORT_NAME variable and the run_query() function defined in Section 1.

# Reading the clinical tables

QUERY_person = """ SELECT * FROM person """
person = run_query(QUERY_person, database=COHORT_NAME)

QUERY_condition_occ = """ SELECT * FROM condition_occurrence """
condition_occ = run_query(QUERY_condition_occ, database=COHORT_NAME)

QUERY_visit_occ = """ SELECT * FROM visit_occurrence """
visit_occurence = run_query(QUERY_visit_occ, database=COHORT_NAME)

QUERY_drug_exp = """ SELECT * FROM drug_exposure """
drug_exp = run_query(QUERY_drug_exp, database=COHORT_NAME)

QUERY_death = """ SELECT * FROM death """
death = run_query(QUERY_death, database=COHORT_NAME)

QUERY_obs_period = """ SELECT * FROM observation_period """
obs_period = run_query(QUERY_obs_period, database=COHORT_NAME)

Each call will print the number of rows and columns returned. Confirm all six tables load successfully before continuing.

[SCREENSHOT: JupyterLab output showing all six tables loaded with their respective row and column counts]

Section 3.4 — Identifying AFib Patients Receiving Metoprolol and Amiodarone

This section filters the data to identify patients with atrial fibrillation (AFib) and determines which of those patients received Metoprolol, Amiodarone, or both.

Group A: AFib patients who received Metoprolol

Group B: AFib patients who received Amiodarone

Step 1 — Filter AFib Patients and Define Groups

# Find patients with AFib (ECG: atrial fibrillation)
afib_patients = condition_occ[
    condition_occ['condition'] == "ECG: atrial fibrillation"
]['person_id'].unique()

# Define Group A — AFib patients who received Metoprolol
group_a_ids = drug_exp[
    (drug_exp['drug'] == "metoprolol tartrate 25 MG Oral Tablet") &
    (drug_exp['person_id'].isin(afib_patients))
]['person_id'].unique()

# Define Group B — AFib patients who received Amiodarone
group_b_ids = drug_exp[
    (drug_exp['drug'] == "3 ML amiodarone hydrochloride 50 MG/ML Injection") &
    (drug_exp['person_id'].isin(afib_patients))
]['person_id'].unique()

Step 2 — Visualise the Overlap Between Groups

# Convert to sets for set operations
set_a = set(group_a_ids)
set_b = set(group_b_ids)

# Calculate overlap counts
only_a  = len(set_a - set_b)          # Metoprolol only
only_b  = len(set_b - set_a)          # Amiodarone only
overlap = len(set_a & set_b)           # Both drugs

# Plot Venn diagram
plt.figure()
v = venn2(
    subsets=(only_a, only_b, overlap),
    alpha=0.7,
    set_labels=('AFib + Metoprolol', 'AFib + Amiodarone')
)

# Set region colours
v.get_patch_by_id('10').set_color('#1f77b4')  # Metoprolol only
v.get_patch_by_id('01').set_color('#FF7F7F')  # Amiodarone only
v.get_patch_by_id('11').set_color('#9C6ADE')  # Both

plt.title("Overlap of Patients Receiving Metoprolol and Amiodarone in AFib")

# Save and display
plt.savefig("/output/3.4_Overlap_of_Afib_Treatment_groups.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the Venn diagram of AFib treatment group overlap]

Section 3.5 — Cohort Validation: Patient Overlap Between the Two Cohorts

This section defines the two final analysis cohorts and validates that they are mutually exclusive (no patient appears in both).

Cohort A: AFib + Metoprolol only (patients who received Metoprolol but not Amiodarone)
Cohort B: AFib + Metoprolol + Amiodarone (patients who received both drugs)

# Cohort A: patients who received Metoprolol but NOT Amiodarone
cohort_a_ids = set(group_a_ids) - set(group_b_ids)

# Cohort B: patients who received both Metoprolol AND Amiodarone
cohort_b_ids = set(group_a_ids).intersection(set(group_b_ids))

# Validate: plot Venn diagram to confirm no overlap between final cohorts
set_metoprolol = set(cohort_a_ids)
set_amiodarone = set(cohort_b_ids)

plt.figure()
venn2(
    [set_metoprolol, set_amiodarone],
    set_colors=('#1f77b4', '#d62728'),
    set_labels=('Cohort A: Metoprolol', 'Cohort B: Metoprolol + Amiodarone')
)
plt.title("Venn Diagram: Check overlap between Cohort A and Cohort B")

# Save and display
plt.savefig("/output/3.5_Patient_Overlap_for_two_cohorts.png", dpi=300)
plt.show()

# Print cohort sizes
print("Metoprolol only:", len(set_metoprolol - set_amiodarone))
print("Amiodarone only:", len(set_amiodarone - set_metoprolol))
print("Both drugs:", len(set_metoprolol.intersection(set_amiodarone)))

[SCREENSHOT: JupyterLab output showing the cohort validation Venn diagram confirming no overlap, with printed cohort sizes]

Section 3.6 — Cohort Profiling

This section characterises and compares the demographic profiles of Cohort A and Cohort B across three dimensions: gender, race, and ethnicity.

Checkpoint 1 — Build the Combined Cohort DataFrame

Run this cell first. It creates labelled DataFrames for each cohort and concatenates them into a single combined_df used by all subsequent profiling steps.

# Create labelled cohort DataFrames
cohort_a = person[person['person_id'].isin(cohort_a_ids)].copy()
cohort_b = person[person['person_id'].isin(cohort_b_ids)].copy()

cohort_a['Cohort'] = 'Metoprolol only'
cohort_b['Cohort'] = 'Metoprolol + Amiodarone'

# Combine into a single DataFrame
combined_df = pd.concat([cohort_a, cohort_b], ignore_index=True)

combined_df.head(2)

[SCREENSHOT: JupyterLab showing the head of combined_df with Cohort column visible]

Section 3.6.1 — Gender Distribution

# Count patients by gender and cohort
gender_counts = combined_df.groupby(
    ['Cohort', 'gender'], observed=True
).size().reset_index(name='Count')

# Define cohort display order and colour palette
cohort_order = ["Metoprolol only", "Metoprolol + Amiodarone"]

combined_df["Cohort"] = pd.Categorical(
    combined_df["Cohort"], categories=cohort_order, ordered=True
)

cohort_palette = {
    "Metoprolol only": "#1f77b4",         # blue
    "Metoprolol + Amiodarone": "#d62728"  # red
}

# Calculate percentage within each cohort
gender_counts["Percent"] = (
    gender_counts.groupby("Cohort", observed=True)["Count"]
    .transform(lambda x: x / x.sum() * 100)
)

sns.set_style("whitegrid")

# Create side-by-side plots (counts and percentages)
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Plot 1: Gender Counts
ax1 = sns.barplot(
    data=gender_counts, x="gender", y="Count",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[0]
)
ax1.set_title("Gender Distribution by Cohort (Counts)")
ax1.set_xlabel("Gender")
ax1.set_ylabel("Number of Patients")
for container in ax1.containers:
    ax1.bar_label(container, padding=3, fontsize=10, fontweight="bold")

# Plot 2: Gender Percentage
ax2 = sns.barplot(
    data=gender_counts, x="gender", y="Percent",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[1]
)
ax2.set_title("Gender Distribution by Cohort (%)")
ax2.set_xlabel("Gender")
ax2.set_ylabel("Percentage")
for container in ax2.containers:
    ax2.bar_label(container, fmt="%.1f%%", padding=3, fontsize=10, fontweight="bold")

# Shared legend
ax1.legend_.remove()
ax2.legend_.remove()
handles, labels = ax1.get_legend_handles_labels()
fig.legend(handles, labels, title="Cohort", loc="center left", bbox_to_anchor=(1.02, 0.5))

plt.tight_layout()
plt.savefig("/output/3.6_1_Cohort_Profiling_Gender.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing side-by-side bar plots for gender distribution by cohort]

Statistical test — Fisher's exact test for gender:

from scipy.stats import fisher_exact

contingency = pd.crosstab(combined_df['Cohort'], combined_df['gender'])
print(contingency)

odds_ratio, p_value = fisher_exact(contingency)
print("Odds Ratio:", odds_ratio)
print("P-value:", p_value)

Section 3.6.2 — Race Distribution

# Count patients by race and cohort
race_counts = combined_df.groupby(
    ['Cohort', 'race'], observed=True
).size().reset_index(name='Count')

# Calculate percentage within each cohort
race_counts["Percent"] = (
    race_counts.groupby("Cohort", observed=True)["Count"]
    .transform(lambda x: x / x.sum() * 100)
)

sns.set_style("whitegrid")
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Plot 1: Race Counts
ax1 = sns.barplot(
    data=race_counts, x="race", y="Count",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[0]
)
ax1.set_title("Race Distribution by Cohort (Counts)")
ax1.set_xlabel("Race")
ax1.set_ylabel("Number of Patients")
for container in ax1.containers:
    ax1.bar_label(container, padding=3, fontsize=10, fontweight="bold")

# Plot 2: Race Percentage
ax2 = sns.barplot(
    data=race_counts, x="race", y="Percent",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[1]
)
ax2.set_title("Race Distribution by Cohort (%)")
ax2.set_xlabel("Race")
ax2.set_ylabel("Percentage")
for container in ax2.containers:
    ax2.bar_label(container, fmt="%.1f%%", padding=3, fontsize=10, fontweight="bold")

# Shared legend
ax1.legend_.remove()
ax2.legend_.remove()
handles, labels = ax1.get_legend_handles_labels()
fig.legend(handles, labels, title="Cohort", loc="center left", bbox_to_anchor=(1.02, 0.5))

plt.tight_layout()
plt.savefig("/output/3.6_2_Cohort_Profiling_Race.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing side-by-side bar plots for race distribution by cohort]

Statistical test — Fisher's exact test for race:

contingency = pd.crosstab(combined_df['Cohort'], combined_df['race'])
print(contingency)

odds_ratio, p_value = fisher_exact(contingency)
print("Odds Ratio:", odds_ratio)
print("P-value:", p_value)

Section 3.6.3 — Ethnicity Distribution

# Count patients by ethnicity and cohort
ethnicity_counts = combined_df.groupby(
    ['Cohort', 'ethnicity'], observed=True
).size().reset_index(name='Count')

# Calculate percentage within each cohort
ethnicity_counts["Percent"] = (
    ethnicity_counts.groupby("Cohort", observed=True)["Count"]
    .transform(lambda x: x / x.sum() * 100)
)

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Plot 1: Ethnicity Counts
ax1 = sns.barplot(
    data=ethnicity_counts, x="ethnicity", y="Count",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[0]
)
ax1.set_title("Ethnicity Distribution by Cohort (Counts)")
ax1.set_xlabel("Ethnicity")
ax1.set_ylabel("Number of Patients")
for container in ax1.containers:
    ax1.bar_label(container, padding=3, fontsize=10, fontweight="bold")

# Plot 2: Ethnicity Percentage
ax2 = sns.barplot(
    data=ethnicity_counts, x="ethnicity", y="Percent",
    hue="Cohort", hue_order=cohort_order, palette=cohort_palette, ax=axes[1]
)
ax2.set_title("Ethnicity Distribution by Cohort (%)")
ax2.set_xlabel("Ethnicity")
ax2.set_ylabel("Percentage")
for container in ax2.containers:
    ax2.bar_label(container, fmt="%.1f%%", padding=3, fontsize=10, fontweight="bold")

# Shared legend
ax1.legend_.remove()
ax2.legend_.remove()
handles, labels = ax1.get_legend_handles_labels()
fig.legend(handles, labels, title="Cohort", loc="center left", bbox_to_anchor=(1.02, 0.5))

plt.tight_layout()
plt.savefig("/output/3.6_3_Cohort_Profiling_Ethnicity.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing side-by-side bar plots for ethnicity distribution by cohort]

Statistical test — Fisher's exact test for ethnicity:

contingency = pd.crosstab(combined_df['Cohort'], combined_df['ethnicity'])
print(contingency)

odds_ratio, p_value = fisher_exact(contingency)
print("Odds Ratio:", odds_ratio)
print("P-value:", p_value)

Section 3.7 — Medical Conditions

This section analyses the clinical conditions recorded for patients in each cohort, first by comparing the total number of unique conditions per patient, then by identifying and comparing the top 15 conditions across cohorts.

Section 3.7.1 — Unique Conditions per Patient

Step 1 — Count and Plot Unique Conditions

# Subset condition_occurrence to cohort patients only
cohort_person_ids = combined_df["person_id"].unique()

condition_subset = condition_occ[
    condition_occ["person_id"].isin(cohort_person_ids)
]

# Count unique conditions per patient
conditions_per_patient = (
    condition_subset
    .groupby("person_id")["condition"]
    .nunique()
    .reset_index(name="Unique_Condition_Count")
)

# Attach cohort label
conditions_per_patient = conditions_per_patient.merge(
    combined_df[["person_id", "Cohort"]],
    on="person_id",
    how="left"
)

# Sort by condition count within each cohort
conditions_per_patient_sorted = (
    conditions_per_patient
    .sort_values(["Cohort", "Unique_Condition_Count"])
)

# Plot unique conditions per patient, separated by cohort
fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

# Cohort A
cohort_a_data = conditions_per_patient_sorted[
    conditions_per_patient_sorted["Cohort"] == "Metoprolol only"
]
sns.barplot(
    data=cohort_a_data, x="person_id", y="Unique_Condition_Count",
    ax=axes[0], color="steelblue", order=cohort_a_data["person_id"]
)
axes[0].set_title("Cohort A: Metoprolol only", fontweight="bold")
axes[0].set_xlabel("Person ID", fontweight="bold")
axes[0].set_ylabel("Unique Condition Count", fontweight="bold")
axes[0].tick_params(axis='x', rotation=90)
for container in axes[0].containers:
    axes[0].bar_label(container, fontweight="bold")

# Cohort B
cohort_b_data = conditions_per_patient_sorted[
    conditions_per_patient_sorted["Cohort"] == "Metoprolol + Amiodarone"
]
sns.barplot(
    data=cohort_b_data, x="person_id", y="Unique_Condition_Count",
    ax=axes[1], color="#d62728", order=cohort_b_data["person_id"]
)
axes[1].set_title("Cohort B: Metoprolol + Amiodarone", fontweight="bold")
axes[1].set_xlabel("Person ID", fontweight="bold")
axes[1].tick_params(axis='x', rotation=90)
for container in axes[1].containers:
    axes[1].bar_label(container, fontweight="bold")

plt.suptitle("Unique Conditions per Patient", fontweight="bold")
plt.tight_layout()
plt.savefig("/output/3.7_1_Unique_conditions.png")
plt.show()

[SCREENSHOT: JupyterLab output showing side-by-side bar plots of unique condition counts per patient for each cohort]

Step 2 — Statistical Comparison (Mann-Whitney U Test)

sns.set_style("white")
plt.rcParams["axes.facecolor"] = "white"
plt.rcParams["figure.facecolor"] = "white"

# Separate cohort groups
a = conditions_per_patient[
    conditions_per_patient["Cohort"] == "Metoprolol only"
]["Unique_Condition_Count"]

b = conditions_per_patient[
    conditions_per_patient["Cohort"] == "Metoprolol + Amiodarone"
]["Unique_Condition_Count"]

# Mann-Whitney U test
stat, p = mannwhitneyu(a, b)

median_a = a.median()
median_b = b.median()

# Boxplot with individual data points overlaid
plt.figure(figsize=(8, 6))

sns.boxplot(
    data=conditions_per_patient,
    x="Cohort", y="Unique_Condition_Count",
    hue="Cohort", palette=["#1f77b4", "#d62728"],
    boxprops=dict(alpha=0.6), legend=False
)

sns.stripplot(
    data=conditions_per_patient,
    x="Cohort", y="Unique_Condition_Count",
    color="black", size=7, jitter=0.15, zorder=10
)

# Annotate with p-value and medians
plt.text(
    0.5, conditions_per_patient["Unique_Condition_Count"].max() + 0.6,
    f"Mann–Whitney p = {p:.3f}", ha="center", fontsize=12, fontweight="bold"
)
plt.text(0, median_a + 65, f"Median = {median_a}", ha='center', fontweight='bold')
plt.text(1, median_b + 59.5, f"Median = {median_b}", ha='center', fontweight='bold')

plt.title("Unique Conditions per Patient by Cohort", fontweight="bold")
plt.ylabel("Unique Condition Count")
plt.xlabel("")
sns.despine()
plt.tight_layout()
plt.savefig("/output/3.7_1_Number_of_Unique_conditions.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the boxplot with individual data points and Mann-Whitney p-value annotation]

Section 3.7.2 — Top 15 Conditions Comparison

Checkpoint 2 — Build Condition Counts by Cohort

# Filter condition_occurrence to cohort patients
cohort_person_ids = combined_df["person_id"].unique()

condition_occ_subset = condition_occ[
    condition_occ["person_id"].isin(cohort_person_ids)
]

# Attach cohort labels
condition_with_cohort = condition_occ_subset.merge(
    combined_df[["person_id", "Cohort"]],
    on="person_id",
    how="left"
)

# Count occurrences of each condition per cohort
condition_counts = (
    condition_with_cohort
    .groupby(["condition", "Cohort"], observed=True)
    .size()
    .reset_index(name="Count")
)

Step 1 — Venn Diagram of Top 15 Conditions

# Top 15 conditions for each cohort
top15_A_named = (
    condition_counts[condition_counts["Cohort"] == "Metoprolol only"]
    .sort_values("Count", ascending=False)
    .head(15)
)

top15_B_named = (
    condition_counts[condition_counts["Cohort"] == "Metoprolol + Amiodarone"]
    .sort_values("Count", ascending=False)
    .head(15)
)

# Venn diagram of top 15 condition overlap
set_A = set(top15_A_named["condition"])
set_B = set(top15_B_named["condition"])

plt.figure(figsize=(6, 6))
venn2(
    [set_A, set_B],
    set_labels=("Metoprolol only", "Metoprolol + Amiodarone"),
    set_colors=("#1f77b4", "#d62728")
)
plt.title("Overlap of Top 15 Conditions Between Cohorts", fontweight="bold")
plt.savefig("/output/3.7_2_Overlap_Medical_Conditions.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the Venn diagram of top 15 condition overlap between cohorts]

Step 2 — Three-Panel Breakdown of Conditions

This block categorises the top 15 conditions from each cohort into three groups — unique to Cohort A, shared by both, and unique to Cohort B — and displays them as a three-panel chart.

# Identify overlapping and unique condition sets
overlap_ids  = set_A & set_B
unique_A_ids = set_A - set_B
unique_B_ids = set_B - set_A

overlap_conditions  = top15_A_named[top15_A_named["condition"].isin(overlap_ids)]
unique_A_conditions = top15_A_named[top15_A_named["condition"].isin(unique_A_ids)]
unique_B_conditions = top15_B_named[top15_B_named["condition"].isin(unique_B_ids)]

# Sort condition names alphabetically for consistent display
df_A       = pd.DataFrame({"condition": sorted(unique_A_conditions["condition"].tolist())})
df_overlap = pd.DataFrame({"condition": sorted(overlap_conditions["condition"].tolist())})
df_B       = pd.DataFrame({"condition": sorted(unique_B_conditions["condition"].tolist())})

# Three-panel visualisation
fig, axes = plt.subplots(1, 3, figsize=(18, 8), sharey=False)

# Left: Metoprolol only
sns.barplot(data=df_A, y="condition", x=[1]*len(df_A), ax=axes[0], color="steelblue")
axes[0].set_title("Unique to Metoprolol only", fontweight="bold")
axes[0].set_ylabel("Condition", fontweight="bold")
axes[0].set_xlabel("")
axes[0].set_xticks([])

# Middle: Overlap
sns.barplot(data=df_overlap, y="condition", x=[1]*len(df_overlap), ax=axes[1], color="gray")
axes[1].set_title("Overlap", fontweight="bold")
axes[1].set_ylabel("")
axes[1].set_xlabel("")
axes[1].set_xticks([])

# Right: Metoprolol + Amiodarone
sns.barplot(data=df_B, y="condition", x=[1]*len(df_B), ax=axes[2], color="orange")
axes[2].set_title("Unique to Metoprolol + Amiodarone", fontweight="bold")
axes[2].set_ylabel("")
axes[2].set_xlabel("")
axes[2].set_xticks([])

plt.tight_layout()
plt.savefig("/output/3.7_2_Top15_Conditions.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the three-panel bar chart of conditions unique to each cohort and shared between them]

Section 3.8 — Drug Exposure

This section identifies and visualises the top 15 concomitant medications administered to patients in each cohort, based on the number of unique patients receiving each drug.

# Combine patient IDs from both cohorts
cohort_ids = pd.concat([cohort_a["person_id"], cohort_b["person_id"]]).unique()

# Filter drug_exposure to cohort patients
drug_subset = drug_exp[drug_exp["person_id"].isin(cohort_ids)]

# Attach cohort label
drug_subset = drug_subset.merge(
    combined_df[["person_id", "Cohort"]],
    on="person_id",
    how="left"
)

# Count unique patients receiving each drug, per cohort
drug_counts = (
    drug_subset
    .groupby(["drug", "Cohort"], observed=True)["person_id"]
    .nunique()
    .reset_index(name="Patient_Count")
)

# Identify top 15 drugs by total patient count across both cohorts
top_drugs = (
    drug_counts
    .groupby("drug")["Patient_Count"]
    .sum()
    .sort_values(ascending=False)
    .head(15)
    .index
)

drug_plot = drug_counts[drug_counts["drug"].isin(top_drugs)]

# Plot top concomitant medications by cohort
plt.figure(figsize=(16, 6))

ax = sns.barplot(
    data=drug_plot,
    x="drug", y="Patient_Count",
    hue="Cohort",
    palette={"Metoprolol only": "#1f77b4", "Metoprolol + Amiodarone": "#d62728"}
)

plt.xticks(rotation=60, ha="right")
plt.xlabel("Drug Administered")
plt.ylabel("Number of Patients")
plt.title("Top Concomitant Medications by Cohort\n")

for container in ax.containers:
    ax.bar_label(container)

ax.legend(title="Cohort", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.savefig("/output/3.8_Drug_Exposure.png")
plt.show()

[SCREENSHOT: JupyterLab output showing the grouped bar chart of top 15 concomitant medications by cohort]

Section 3.9 — Length of Stay (LOS)

This section calculates the maximum inpatient length of stay per patient (in days) and compares distributions between cohorts using a boxplot and the Mann-Whitney U test.

# Filter visit_occurrence to cohort patients
cohort_ids = pd.concat([cohort_a["person_id"], cohort_b["person_id"]]).unique()

visits = visit_occurence[visit_occurence["person_id"].isin(cohort_ids)]

# Attach cohort label
visits = visits.merge(
    combined_df[["person_id", "Cohort"]],
    on="person_id",
    how="left"
)

# Calculate LOS in days for each visit
visits["LOS_days"] = (
    pd.to_datetime(visits["visit_end_date"]) -
    pd.to_datetime(visits["visit_start_date"])
).dt.days

# Take the maximum LOS per patient (to capture the longest admission)
los_patient = (
    visits
    .groupby(["person_id", "Cohort"], observed=True)["LOS_days"]
    .max()
    .reset_index()
)

# Boxplot comparing LOS between cohorts
plt.figure(figsize=(7, 5))
sns.boxplot(
    data=los_patient,
    x="Cohort", y="LOS_days",
    hue="Cohort",
    palette={"Metoprolol only": "#1f77b4", "Metoprolol + Amiodarone": "#d62728"}
)
plt.ylabel("Length of Stay (days)")
plt.title("Length of Stay Comparison Between Cohorts")
plt.savefig("/output/3.9_Length_of_Stay.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the LOS boxplot comparing the two cohorts]

Median LOS per cohort:

los_patient.groupby("Cohort", observed=True)["LOS_days"].median()

Statistical test — Mann-Whitney U test for LOS:

cohort_a_los = los_patient[los_patient["Cohort"] == "Metoprolol only"]["LOS_days"]
cohort_b_los = los_patient[los_patient["Cohort"] == "Metoprolol + Amiodarone"]["LOS_days"]

stat, p_value = mannwhitneyu(cohort_a_los, cohort_b_los, alternative='two-sided')
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")

Section 3.10 — Kaplan-Meier Survival Analysis

This section estimates and compares the survival functions of both cohorts using Kaplan-Meier analysis, with time measured in days from each patient's first drug exposure. The log-rank test is used to assess whether any observed difference in survival is statistically significant.

Step 1 — Build the Survival DataFrame and Plot the KM Curve

# Anchor time 0 to each patient's first drug exposure date
drug_start = (
    drug_subset
    .sort_values("drug_exposure_start_date")
    .groupby("person_id")
    .first()
    .reset_index()[["person_id", "drug_exposure_start_date"]]
)

# Merge death dates
survival_df = drug_start.merge(
    death[["person_id", "death_date"]],
    on="person_id", how="left"
)

# Attach cohort labels
survival_df = survival_df.merge(
    combined_df[["person_id", "Cohort"]],
    on="person_id", how="left"
)

# Parse dates
survival_df["drug_exposure_start_date"] = pd.to_datetime(
    survival_df["drug_exposure_start_date"]
)
survival_df["death_date"] = pd.to_datetime(survival_df["death_date"])

# Event indicator: 1 = death recorded, 0 = censored
survival_df["event"] = survival_df["death_date"].notna().astype(int)

# For censored patients, use observation period end date as the end date
survival_df = survival_df.merge(
    obs_period[["person_id", "observation_period_end_date"]],
    on="person_id", how="left"
)

survival_df["end_date"] = survival_df["death_date"].fillna(
    survival_df["observation_period_end_date"]
)

# Calculate follow-up duration in days
survival_df["time_days"] = (
    survival_df["end_date"] - survival_df["drug_exposure_start_date"]
).dt.days

# Fit and plot KM curves for each cohort
kmf = KaplanMeierFitter()

plt.figure(figsize=(7, 5))

colors = {
    "Metoprolol only": "#1f77b4",
    "Metoprolol + Amiodarone": "#d62728"
}

for cohort, df in survival_df.groupby("Cohort", observed=True):
    kmf.fit(
        durations=df["time_days"],
        event_observed=df["event"],
        label=cohort
    )
    kmf.plot_survival_function(
        ci_show=False, color=colors[cohort], linewidth=2
    )

plt.title("Kaplan-Meier Survival Curve by Treatment Cohort")
plt.xlabel("Days since drug exposure")
plt.ylabel("Survival probability")
plt.legend(title="Cohort")
plt.grid(alpha=0.3)
plt.savefig("/output/3.10_KM_Survival_Plot.png", dpi=300)
plt.show()

[SCREENSHOT: JupyterLab output showing the Kaplan-Meier survival curves for both cohorts]

Step 2 — Log-rank Test

The log-rank test assesses whether the difference in survival functions between the two cohorts is statistically significant.

group1 = survival_df[survival_df["Cohort"] == "Metoprolol only"]
group2 = survival_df[survival_df["Cohort"] == "Metoprolol + Amiodarone"]

results = logrank_test(
    group1["time_days"],
    group2["time_days"],
    event_observed_A=group1["event"],
    event_observed_B=group2["event"]
)

print("Log-rank test statistic:", results.test_statistic)
print("P-value:", results.p_value)

[SCREENSHOT: JupyterLab output showing the printed log-rank test statistic and p-value]

Note: All plots generated in this analysis are automatically saved to the /output/ directory on your workstation. These files are accessible under the Results tab in the TRE workstation details pane and can be downloaded via the standard data download workflow.

What's Next

Download your plots and results — See Downloading Data Results for the file download workflow from the Results tab.
Run a bioinformatics pipeline — See Dataset Analysis using Pipelines.