Portfolio Details

CDISC-Compliant Clinical Data Pipeline with Dagster

Project Overview

This project demonstrates a fully automated clinical data pipeline based on CDISC standards using Dagster orchestration. The pipeline simulates end-to-end clinical trial data processing: from raw data collection to SDTM mapping, ADaM derivation, QC validation, TLF generation, and regulatory-ready exports.

Key Features

Data Simulation: Patient-level data generation including demographics, adverse events, medications, exposure, vitals using Faker library.
SDTM Mapping: Fully automated SDTM domain mapping (DM, AE, CM, EX, VS) following CDISC CDASH and SDTM guidelines.
ADaM Derivation: Automated derivation of analysis-ready datasets (ADSL, ADAE, ADVS, ADCM, ADEX).
QC Layer: Automated consistency checks between SDTM and ADaM datasets.
TLF Ready Dataset: Analytical flags and variables ready for table, listing, and figure generation.
Regulatory Exports: Automatic export to XPT (SAS Transport Format) fully Pinnacle21 compatible.
Full Dagster Orchestration: Asset-based orchestration with IO Managers, resource management and multi-study partitioning.

Technologies Used

Python for full data processing pipeline
Pandas / Pyreadstat for data manipulation and export
Dagster for modern data orchestration
CDISC Standards (SDTM / ADaM)

Conclusion

This personal project simulates a real-world clinical data pipeline, as implemented in CROs, pharma companies or academic research centers, automating clinical trial data processing according to regulatory requirements.

Open for freelance opportunities to build or automate clinical data pipelines in CDISC-compliant environments.

Project information

Category: Clinical Data Engineering
Project date: June 2025
Project URL: GitHub Repository
View Code