Skip to main content

Oncology relational dataset derived from MIMIC-IV


This project presents a curated, oncology-focused relational dataset derived from the MIMIC-IV v3.1 clinical database, one of the largest publicly available collections of longitudinal, patient-level hospital data from the Beth Israel Deaconess Medical Center.

While MIMIC-IV is not an oncology cohort and contains a limited number of cancer patients relative to dedicated registries, it provides a rich clinical environment for exploring how data can be structured, queried, and integrated at scale.

The dataset isolates and organises clinical signals relevant to cancer patients to support exploratory analysis, visualisation, and methodological machine learning workflows. It is designed as a technical foundation for bridging routine clinical data (MIMIC-IV) with external molecular resources (e.g. TCGA).

Workflow


Raw Clinical Tables → Oncology Filtering → Signal Selection → Derived Oncology Cohort → Dataset


  1. Extract core clinical tables
    Imported key MIMIC-IV tables (patients, admissions, diagnoses_icd, icustays, prescriptions, labevents) into a local PostgreSQL environment

  2. Identify oncology-relevant records
    Filtered diagnoses using cancer related ICD-9 (140–239) and ICD-10 (C00–D49) codes to define the clinical scope of the dataset

  3. Select relevant laboratory and treatment signals
    Retained laboratory variables such as blood counts, liver enzymes, and selected tumour markers, together with medications linked to chemotherapy, immunotherapy, or hormonal treatment

  4. Define a derived oncology cohort
    Built an ICD-based cohort table to enable efficient filtering and reproducible patient-level selection across downstream analyses

  5. Create supporting layers
    Added derived views for treatment timelines and labelled laboratory values to facilitate longitudinal interpretation and exploratory modelling

  6. Optimise for querying and reuse
    Applied indexes and materialised views to support interactive querying, dashboarding, and future analytical extensions

Metadata


Metadata

  • Source: MIMIC-IV v3.1
  • Institution: Beth Israel Deaconess Medical Center (2008–2022)
  • Modules Used: hosp, icu
  • Programming: PostgreSQL, Python (ETL), SQL (views and indexing)
  • Focus: Oncology-relevant EHR subset
  • Primary Cohort Definition: Patients with ICD-9/10 neoplasm codes
  • Compliance: PhysioNet DUA 1.5.0, HIPAA-deidentified, time-shifted

Core Schema Structure


Schema Structure

The dataset is organised into three complementary table categories:

CategoryDescriptionExamples
Main TablesCore entities: patients, admissions, ICU stays, cohort flagsoncology_patients, oncology_admissions, oncology_icustays, oncology_cohort
Fact TablesEvent-level clinical records: diagnoses, labs, prescriptions, medication adminoncology_diagnoses, oncology_labs, oncology_prescriptions, oncology_emar_detail
Reference TablesDictionaries and derived views for enrichment and summarisationoncology_icd_dict, oncology_labs_with_labels, oncology_emar_detail_with_times, oncology_treatment_windows

Oncology-focused MIMIC-IV schema

Use Cases


  • Track cancer patients longitudinally (laboratory values, medications, ICU exposure)
  • Prototype outcome or response modelling pipelines (methodological scope)
  • Build dashboards and exploratory clinical analytics
  • Practice reproducible clinical data engineering and SQL workflows
  • Integrate clinical trajectories with external genomics resources (e.g. TCGA)