Skip to main content

Lung Cancer Progression and Outcomes


This case study presents an exploratory, end-to-end oncology analysis pipeline focused on modelling disease progression and outcomes in lung cancer patients using real-world clinical data.

It combines structured EHR data (laboratory values, medications, demographics) with AI-extracted clinical signals from free-text notes to derive progression-free survival (PFS) metrics and prototype survival models.
The emphasis is methodological: demonstrating how heterogeneous clinical data sources can be integrated into reproducible oncology analytics.

Note: Due to the limited size and heterogeneity of oncology patients in MIMIC-IV, this case study is not intended to support definitive clinical inference. It serves as a methodological demonstration of clinical NLP integration and survival-analysis workflows.


1 · Build the lung cancer cohort

Objective: Identify patients with a primary lung cancer diagnosis and define a reproducible start point for follow-up.

Method:

  • Selected admissions with ICD-10 codes C34* or ICD-9 codes 162* (malignant neoplasm of the lung)
  • Defined the first qualifying hospital encounter as the index date

Output:

  • lung_cancer_cohort.csv — one row per patient (4 220 patients)
2 · Extract progression / response events

Objective: Identify the earliest documented evidence of disease progression or treatment response.

Method:

  • Applied fine-tuned clinical NLP models to radiology reports and discharge summaries
    (Kehl et al., 2024)
  • Models output per-note probabilities for progression, response, and metastatic involvement
    (brain, bone, liver, adrenal, lung, lymph nodes, peritoneum)
  • Flagged the earliest note exceeding a probability threshold of ≥ 0.50

Outputs:

  • mimic_lung_radiology_preds.csv
  • mimic_lung_discharge_preds.csv
3 · Derive progression-free survival (PFS)

Objective: Quantify time from diagnosis to first documented progression or response.

Method:

  • Start: index date
  • Event: earliest progression or response signal
  • Censoring: date of last available clinical note
  • Computed PFS in days per patient

Output:

  • lung_events.csv — event dates and PFS durations
4 · Generate baseline features

Objective: Capture baseline patient characteristics relevant for exploratory risk modelling.

Method:

  • Demographics: age and sex
  • Biomarkers: first LDH measurement within ±7 days of index date
  • Treatment exposure: flags for platinum chemotherapy, TKIs, and checkpoint inhibitors
    (regex-based drug mapping)

Output:

  • lung_baseline.csv — baseline features per patient
5 · Merge & analyse

Objective: Assemble a unified analytic dataset and perform exploratory survival analysis.

Method:

  • Merged cohort, event, and baseline feature tables
  • Generated Kaplan–Meier curves (overall and stratified by treatment class)
  • Fit a Cox proportional-hazards model using age, sex, LDH, and treatment flags

Output:

  • lung_pfs.csv — final analytic table (3 204 patients with follow-up)

Pipeline Overview

SQL cohort ─┐
├──► 1 Export lung_radiology.csv ──► 2 NLP inference (DFCI-imaging) ─┐
├──► 1 Export lung_discharge.csv ──► 2 NLP inference (DFCI-medonc) ─┤
│ │
└──► 3 Baseline extract (age/sex/LDH) ─────────────────────────────────┤

4 Derive events ──► lung_events.csv
5 Drug-class flags ─► lung_drug_flags.csv
6 Merge ─► lung_pfs.csv
7 KM + Cox plots ─► notebook / report

Ongoing Extensions (Exploratory)


6 · Toward automated cancer staging

Goal:
Approximate disease stage using site-specific metastasis predictions extracted from radiology notes.

Approach:

  • Aggregated metastasis probabilities (brain, bone, liver, adrenal, lung, lymph nodes, peritoneum) into a composite “extent of disease” score
  • Mapped these signals to simplified staging categories (e.g. localised vs metastatic)

Rationale:
Bridges raw NLP-derived outputs with clinically interpretable staging proxies.

7 · Predicting early progression

Goal:
Identify patients likely to experience disease progression within 90–180 days of the index date.

Approach:

  • Features: demographics, baseline LDH, treatment flags, and aggregated NLP outputs
  • Models: XGBoost and regularised logistic regression

Rationale:
Explores feasibility of early risk stratification and supports hypothesis generation for clinical trial matching.

8 · Exploratory patient embeddings

Goal:
Represent patients in a latent clinical space to uncover similarity patterns across disease trajectories.

Approach:

  • Embedded laboratory values, treatment exposure, and NLP-derived clinical signals into a low-dimensional representation
  • Visualised patient clusters associated with progression dynamics and therapeutic exposure

Diagram

        ┌─────────────────────────────┐
│ Core Pipeline Output │
│ (lung_pfs.csv) │
└─────────────┬───────────────┘

┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Automated │ │ Early Progression │ │ Patient Embeddings │
│ Staging │ │ Prediction │ │ (Clinical Similarity│
│ (metastasis → │ │ (XGBoost / Logistic │ │ & Phenotyping) │
│ stage) │ │ Regression) │ └─────────────────────┘
└───────────────┘ └─────────────────────┘