Skip to main content

Integrating Kinase Bioactivity with Expression Profiles


This project involves integration of reference data and modeling using semantic web technologies.

As a proof of concept, it combines kinase-targeted bioactivity data (ChEMBL) with normal (GTEx) and tumor (TCGA) gene expression profiles, harmonized via UniProt ID mappings and exposed through a GraphDB triple store:

Pipeline Overview


Integration Pipeline


Diagram illustrating the modular RDF-based data integration pipeline. Each dataset (ChEMBL, GTEx, TCGA, and UniProt) is processed independently and exported to Turtle (TTL) format, then loaded into GraphDB as named graphs. These datasets are harmonized through ontology-aligned identifiers (e.g., UniProt ↔ Ensembl ↔ UBERON) to enable semantic joins across:

Compound → Kinase → Gene → Expression


The pipeline supports cross-dataset querying, allowing questions such as: "Which compounds target kinases that are overexpressed in kidney tumors but not in healthy tissue?"

The Human Protein Atlas is indicated as a future enhancement, to bring in protein-level validation of expression findings.

By combining multiple curated biomedical datasets into a unified semantic framework, this project explores methods that support scalable data reuse, integration, and interpretability in research settings.

Use Cases


This integrated knowledge graph enables semantic exploration across compounds, kinase targets, gene identifiers, and expression profiles in normal and tumor tissues. Example use cases include:

  • Prioritize kinase targets based on tumor-specific overexpression
  • Identify kinases with low expression in healthy tissues to reduce off-target toxicity
  • Rank compounds by target gene expression in selected cancer types
  • Filter targets by expression contrast across multiple tissues (e.g. GTEx vs. TCGA)
  • Explore compound-target coverage across tumor types for combination strategies