Lung Cancer Progression and Outcomes
This case study presents an exploratory, end-to-end oncology analysis pipeline focused on modelling disease progression and outcomes in lung cancer patients using real-world clinical data.
It combines structured EHR data (laboratory values, medications, demographics) with AI-extracted clinical signals from free-text notes to derive progression-free survival (PFS) metrics and prototype survival models.
The emphasis is methodological: demonstrating how heterogeneous clinical data sources can be integrated into reproducible oncology analytics.
Note: Due to the limited size and heterogeneity of oncology patients in MIMIC-IV, this case study is not intended to support definitive clinical inference. It serves as a methodological demonstration of clinical NLP integration and survival-analysis workflows.
1 · Build the lung cancer cohort
Objective: Identify patients with a primary lung cancer diagnosis and define a reproducible start point for follow-up.
Method:
- Selected admissions with ICD-10 codes C34* or ICD-9 codes 162* (malignant neoplasm of the lung)
- Defined the first qualifying hospital encounter as the index date
Output:
lung_cancer_cohort.csv— one row per patient (4 220 patients)
2 · Extract progression / response events
Objective: Identify the earliest documented evidence of disease progression or treatment response.
Method:
- Applied fine-tuned clinical NLP models to radiology reports and discharge summaries
(Kehl et al., 2024) - Models output per-note probabilities for progression, response, and metastatic involvement
(brain, bone, liver, adrenal, lung, lymph nodes, peritoneum) - Flagged the earliest note exceeding a probability threshold of ≥ 0.50
Outputs:
mimic_lung_radiology_preds.csvmimic_lung_discharge_preds.csv
3 · Derive progression-free survival (PFS)
Objective: Quantify time from diagnosis to first documented progression or response.
Method:
- Start: index date
- Event: earliest progression or response signal
- Censoring: date of last available clinical note
- Computed PFS in days per patient
Output:
lung_events.csv— event dates and PFS durations
4 · Generate baseline features
Objective: Capture baseline patient characteristics relevant for exploratory risk modelling.
Method:
- Demographics: age and sex
- Biomarkers: first LDH measurement within ±7 days of index date
- Treatment exposure: flags for platinum chemotherapy, TKIs, and checkpoint inhibitors
(regex-based drug mapping)
Output:
lung_baseline.csv— baseline features per patient
5 · Merge & analyse
Objective: Assemble a unified analytic dataset and perform exploratory survival analysis.
Method:
- Merged cohort, event, and baseline feature tables
- Generated Kaplan–Meier curves (overall and stratified by treatment class)
- Fit a Cox proportional-hazards model using age, sex, LDH, and treatment flags
Output:
lung_pfs.csv— final analytic table (3 204 patients with follow-up)
Pipeline Overview
SQL cohort ─┐
├──► 1 Export lung_radiology.csv ──► 2 NLP inference (DFCI-imaging) ─┐
├──► 1 Export lung_discharge.csv ──► 2 NLP inference (DFCI-medonc) ─┤
│ │
└──► 3 Baseline extract (age/sex/LDH) ─────────────────────────────────┤
▼
4 Derive events ──► lung_events.csv
5 Drug-class flags ─► lung_drug_flags.csv
6 Merge ─► lung_pfs.csv
7 KM + Cox plots ─► notebook / report
Ongoing Extensions (Exploratory)
6 · Toward automated cancer staging
Goal:
Approximate disease stage using site-specific metastasis predictions extracted from radiology notes.
Approach:
- Aggregated metastasis probabilities (brain, bone, liver, adrenal, lung, lymph nodes, peritoneum) into a composite “extent of disease” score
- Mapped these signals to simplified staging categories (e.g. localised vs metastatic)
Rationale:
Bridges raw NLP-derived outputs with clinically interpretable staging proxies.
7 · Predicting early progression
Goal:
Identify patients likely to experience disease progression within 90–180 days of the index date.
Approach:
- Features: demographics, baseline LDH, treatment flags, and aggregated NLP outputs
- Models: XGBoost and regularised logistic regression
Rationale:
Explores feasibility of early risk stratification and supports hypothesis generation for clinical trial matching.
8 · Exploratory patient embeddings
Goal:
Represent patients in a latent clinical space to uncover similarity patterns across disease trajectories.
Approach:
- Embedded laboratory values, treatment exposure, and NLP-derived clinical signals into a low-dimensional representation
- Visualised patient clusters associated with progression dynamics and therapeutic exposure
Diagram
┌─────────────────────────────┐
│ Core Pipeline Output │
│ (lung_pfs.csv) │
└─────────────┬───────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Automated │ │ Early Progression │ │ Patient Embeddings │
│ Staging │ │ Prediction │ │ (Clinical Similarity│
│ (metastasis → │ │ (XGBoost / Logistic │ │ & Phenotyping) │
│ stage) │ │ Regression) │ └─────────────────────┘
└───────────────┘ └─────────────────────┘