Skip to main content

Oncology-focused relational dataset derived from MIMIC


This project comprises a curated, oncology-focused relational dataset derived from the MIMIC-IV v3.1 clinical database, one of the largest publicly available medical datasets that encompass detailed patient-level information at the Beth Israel Deaconess Medical Center. It isolates and structures information relevant to cancer patients to support downstream analysis, visualization, or machine learning workflows, and serves as a foundation for bridging clinical data (MIMIC-IV) with molecular profiles (e.g., TCGA).

  • Extracted key tables from MIMIC-IV (patients, admissions, diagnoses_icd, icustays, prescriptions, labevents) in a local PostgreSQL setup
  • Filtered all data to focus on oncology-relevant fields:
    • Diagnoses using cancer-related ICD-9 (140–239) and ICD-10 (C00–D49) codes
    • Lab values including blood counts, liver enzymes, and tumor markers
    • Medications relevant to chemotherapy, immunotherapy, or hormonal therapy
  • Defined a derived oncology cohort for efficient filtering based on ICD logic
  • Built supporting views for treatment timelines and labeled lab test values
  • Optimized for efficient querying via indexes and materialized views

Metadata


  • Source: MIMIC-IV v3.1
  • Institution: Beth Israel Deaconess Medical Center (2008–2022)
  • Modules Used: hosp, icu
  • Programming: PostgreSQL, Python (ETL), SQL (views/indexes)
  • Focus: Oncology-relevant EHR subset
  • Derived From: Diagnoses, labs, prescriptions, ICU stays
  • Primary Cohort Definition: Patients with ICD-9/10 neoplasm codes
  • Compliance: PhysioNet DUA 1.5.0, HIPAA-deidentified, time-shifted

Core Schema Structure


The dataset is structured using 3 broad types of tables:

CategoryDescriptionExamples
Main TablesCore entities: patient demographics, admissions, ICU stays, cohort flagoncology_patients, oncology_admissions, oncology_icustays, oncology_cohort
Fact TablesEvent-level records: diagnoses, labs, prescriptions, EMAR (med admin records)oncology_diagnoses, oncology_labs, oncology_prescriptions, oncology_emar_detail
Reference TablesDictionaries or derived views for enrichment or summarizationoncology_icd_dict, oncology_labs_with_labels, oncology_emar_detail_with_times, oncology_treatment_windows

Oncology-Focused MIMIC-IV Schema

Use Cases


  • Track cancer patients longitudinally (lab values, medications, ICU stays)
  • Model treatment response or outcomes
  • Build dashboards or visualizations for clinical metrics
  • Practice clinical data engineering and SQL workflows
  • Integrate with genomics registries (e.g. TCGA) for multimodal analysis