Skip to main content

Lung Cancer Progression and Outcomes


Currently working on providing analysis outputs, notebook and files!


This case study explores progression and outcomes in lung cancer patients.

It combines structured data (labs, drugs, demographics) with AI-extracted events from free-text notes to build progression-free survival (PFS) metrics and survival models:

1 · Build the lung cancer cohort

Purpose: Identify all patients with a primary lung cancer diagnosis

How:

  • Selected admissions with ICD‑10 codes C34* or ICD‑9 codes 162* (malignant neoplasm of the lung)
  • Took the first hospital encounter meeting this definition as the “start point” for follow‑up (index date)

Output:

  • lung_cancer_cohort.csv — one row per patient (4 220 patients)
2 · Extract progression / response events

Purpose: Detect when imaging or clinical notes first report disease progression or treatment response.

How:

  • Applied fine‑tuned clinical NLP models to radiology reports and discharge summaries (Kehl et al., 2024)
  • Models produce per‑note probabilities for progression, response, and metastases at seven sites (brain, bone, liver, adrenal, lung, lymph nodes, peritoneum).
  • Flagged the earliest note where progression/response probability ≥ 0.50

Outputs:

  • mimic_lung_radiology_preds.csv
  • mimic_lung_discharge_preds.csv
3 · Derive progression‑free survival (PFS)

Purpose: Quantify how long patients remain progression‑free after their index date

How:

  • Start = index date
  • Event = earliest progression or response note
  • Censor = date of the last available note (if no event observed)
  • Calculated PFS in days for each patient

Output:

  • lung_events.csv — includes event dates and PFS durations
4 · Generate baseline features

Purpose: Add contextual patient characteristics for risk modelling

How:

  • Demographics: age and sex (from patients)
  • Biomarkers: first LDH (lactate dehydrogenase) within ±7 days of index date (tumor burden marker)
  • Treatment flags: checked prescriptions for platinum chemo, TKIs, and checkpoint inhibitors (regex‑based drug mapping)

Output:

  • lung_baseline.csv — one row per patient with these features
5 · Merge & analyze

Purpose: Build a single analytic dataset and perform survival modelling

How:

  • Merged the cohort, events, and baseline features
  • Produced Kaplan–Meier survival curves (overall & by treatment class)
  • Fit a Cox proportional‑hazards model using age, sex, LDH, and drug flags

Output:

  • lung_pfs.csv — final analytic table (3 204 patients with follow‑up)

Pipeline Overview

SQL cohort ─┐
├──► 1 Export lung_radiology.csv ──► 2 NLP inference (DFCI-imaging) ─┐
├──► 1 Export lung_discharge.csv ──► 2 NLP inference (DFCI-medonc) ─┤
│ │
└──► 3 Baseline extract (age/sex/LDH) ─────────────────────────────────┤

4 Derive events ──► lung_events.csv
5 Drug-class flags ─► lung_drug_flags.csv
6 Merge ─► lung_pfs.csv
7 KM + Cox plots ─► notebook / report

Ongoing


6 · Toward automated cancer staging

Goal:
Use the metastasis predictions (brain, bone, liver, etc.) from radiology notes to approximate cancer staging

How:

  • Combined site-specific metastasis probabilities into a composite “extent of disease” score
  • Mapped these findings to simplified staging categories (e.g., localized vs metastatic)

Why:
Creates a bridge between raw NLP outputs and clinically meaningful staging estimates

7 · Predicting early progression

Goal:
Train ML models to predict which patients will progress within 90–180 days of their index date

How:

  • Features: demographics, LDH, drug flags, and aggregated Kehl model outputs (max/mean metastasis & progression probabilities from early notes)
  • Models: XGBoost and regularized logistic regression

Why:
Testing proactive risk prediction, and potentially matching patients to appropriate clinical trials

8 · Exploratory patient embeddings

Goal:
Represent patients in a latent space based on their clinical profiles to uncover hidden similarity patterns

How:

  • Used dimensionality reduction / embedding techniques to encode labs, treatment flags, and derived probabilities from notes into a low‑dimensional patient representation
  • Visualized embeddings to explore clusters of patients with similar progression patterns or treatment exposures

Diagram

        ┌─────────────────────────────┐
│ Core Pipeline Output │
│ (lung_pfs.csv) │
└─────────────┬───────────────┘

┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Automated │ │ Early Progression │ │ Patient Embeddings │
│ Staging │ │ Prediction │ │ (Clinical Similarity│
│ (metastasis → │ │ (XGBoost / Logistic │ │ & Phenotyping) │
│ stage) │ │ Regression) │ └─────────────────────┘
└───────────────┘ └─────────────────────┘