Lung Cancer Progression and Outcomes
Currently working on providing analysis outputs, notebook and files!
This case study explores progression and outcomes in lung cancer patients.
It combines structured data (labs, drugs, demographics) with AI-extracted events from free-text notes to build progression-free survival (PFS) metrics and survival models:
1 · Build the lung cancer cohort
Purpose: Identify all patients with a primary lung cancer diagnosis
How:
- Selected admissions with ICD‑10 codes C34* or ICD‑9 codes 162* (malignant neoplasm of the lung)
- Took the first hospital encounter meeting this definition as the “start point” for follow‑up (index date)
Output:
lung_cancer_cohort.csv
— one row per patient (4 220 patients)
2 · Extract progression / response events
Purpose: Detect when imaging or clinical notes first report disease progression or treatment response.
How:
- Applied fine‑tuned clinical NLP models to radiology reports and discharge summaries (Kehl et al., 2024)
- Models produce per‑note probabilities for progression, response, and metastases at seven sites (brain, bone, liver, adrenal, lung, lymph nodes, peritoneum).
- Flagged the earliest note where progression/response probability ≥ 0.50
Outputs:
mimic_lung_radiology_preds.csv
mimic_lung_discharge_preds.csv
3 · Derive progression‑free survival (PFS)
Purpose: Quantify how long patients remain progression‑free after their index date
How:
- Start = index date
- Event = earliest progression or response note
- Censor = date of the last available note (if no event observed)
- Calculated PFS in days for each patient
Output:
lung_events.csv
— includes event dates and PFS durations
4 · Generate baseline features
Purpose: Add contextual patient characteristics for risk modelling
How:
- Demographics: age and sex (from
patients
) - Biomarkers: first LDH (lactate dehydrogenase) within ±7 days of index date (tumor burden marker)
- Treatment flags: checked prescriptions for platinum chemo, TKIs, and checkpoint inhibitors (regex‑based drug mapping)
Output:
lung_baseline.csv
— one row per patient with these features
5 · Merge & analyze
Purpose: Build a single analytic dataset and perform survival modelling
How:
- Merged the cohort, events, and baseline features
- Produced Kaplan–Meier survival curves (overall & by treatment class)
- Fit a Cox proportional‑hazards model using age, sex, LDH, and drug flags
Output:
lung_pfs.csv
— final analytic table (3 204 patients with follow‑up)
Pipeline Overview
SQL cohort ─┐
├──► 1 Export lung_radiology.csv ──► 2 NLP inference (DFCI-imaging) ─┐
├──► 1 Export lung_discharge.csv ──► 2 NLP inference (DFCI-medonc) ─┤
│ │
└──► 3 Baseline extract (age/sex/LDH) ─────────────────────────────────┤
▼
4 Derive events ──► lung_events.csv
5 Drug-class flags ─► lung_drug_flags.csv
6 Merge ─► lung_pfs.csv
7 KM + Cox plots ─► notebook / report
Ongoing
6 · Toward automated cancer staging
Goal:
Use the metastasis predictions (brain, bone, liver, etc.) from radiology notes to approximate cancer staging
How:
- Combined site-specific metastasis probabilities into a composite “extent of disease” score
- Mapped these findings to simplified staging categories (e.g., localized vs metastatic)
Why:
Creates a bridge between raw NLP outputs and clinically meaningful staging estimates
7 · Predicting early progression
Goal:
Train ML models to predict which patients will progress within 90–180 days of their index date
How:
- Features: demographics, LDH, drug flags, and aggregated Kehl model outputs (max/mean metastasis & progression probabilities from early notes)
- Models: XGBoost and regularized logistic regression
Why:
Testing proactive risk prediction, and potentially matching patients to appropriate clinical trials
8 · Exploratory patient embeddings
Goal:
Represent patients in a latent space based on their clinical profiles to uncover hidden similarity patterns
How:
- Used dimensionality reduction / embedding techniques to encode labs, treatment flags, and derived probabilities from notes into a low‑dimensional patient representation
- Visualized embeddings to explore clusters of patients with similar progression patterns or treatment exposures
Diagram
┌─────────────────────────────┐
│ Core Pipeline Output │
│ (lung_pfs.csv) │
└─────────────┬───────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Automated │ │ Early Progression │ │ Patient Embeddings │
│ Staging │ │ Prediction │ │ (Clinical Similarity│
│ (metastasis → │ │ (XGBoost / Logistic │ │ & Phenotyping) │
│ stage) │ │ Regression) │ └─────────────────────┘
└───────────────┘ └─────────────────────┘