The deferred debt problem
Most early-stage biotech teams treat data infrastructure as a Phase 2 problem. They run Phase 1 in whatever format the CRO delivers, analyze it in R or SAS with a custom script, and put the results in a slide deck. The plan is to clean it up before the Phase 2 package is due.
By the time Phase 2 arrives, the cleaning project has become a rebuilding project. The original CRO is no longer on contract. The R script runs on a version of R from two updates ago. The data dictionary is in a tab of a spreadsheet that nobody has opened since week three. And the analyst who built it is gone.
This is the bottleneck. Not the science. The data.
The question that triggers the crisis
"Can we pool the Phase 1 PK data with the new cohorts for the Phase 2 submission package?" If the answer involves a multi-week data archaeology project, the Phase 1 data infrastructure was not built for what comes next.
What CDISC actually requires and when it matters
The FDA expects CDISC-compliant datasets for NDAs and BLAs. For early-phase programs it is not a hard legal requirement. But it shapes how reviewers interact with your data, and it shapes how you interact with your own data when you need to do anything non-trivial with it.
The three datasets that matter most for early pharma programs are straightforward in concept. ADPC is the analysis dataset for PK concentrations: what your NCA outputs are derived from. ADPP is the analysis dataset for PK parameters: AUC, Cmax, half-life, and all derived values. ADSL is the subject-level dataset: demographics, dosing, treatment arms, and flags for analysis populations.
If you build these correctly from the start, even in simplified form, you gain three things: reproducibility across analysts, portability across tools and CROs, and a head start on regulatory submissions that most teams pay for in crunch mode just before a deadline.
The three most common Phase 1 data failures
The same patterns appear repeatedly in early-stage clinical data:
Column naming inconsistency across studies. Study 001 has ANALYTE, Study 002 has analyte_name, Study 003 has Compound. Merging them requires manual intervention every time. The fix is a naming convention applied from the first study and enforced at CRO contract time.
Derived variables computed in the analysis script, not stored in the dataset. This means any change to the script, even a minor one, makes results from six months ago non-reproducible without running the old script version, which nobody can find. Derived variables that will appear in regulatory outputs belong in the dataset, not in the analysis code.
No audit trail between raw data and reported values. A reviewer asks: where does this Cmax come from? The answer is "the script" and the script is in a folder called final_v3_REALLY_FINAL. An audit-ready pipeline traces every reported value to the dataset row and the code line that produced it.
What good looks like at 10 people
You do not need a data engineering team to avoid these problems. You need three things applied consistently from the start of the first study.
A consistent naming convention aligned to CDISC from day one. Pick CDISC-aligned variable names for your internal datasets, even before you are required to submit them. The cost of switching later is far higher than the cost of learning the convention now. The Pharmaverse package ecosystem makes this practical for R teams without a SAS background.
A separation between raw data, derived data, and reported values. Raw data is what the CRO delivers. Derived data is what your pipeline produces: flagged populations, computed parameters, normalized values. Reported values are what appears in the report. These should be three distinct, versioned artifacts, not three tabs in the same spreadsheet.
A pipeline that is code, not a spreadsheet. Even a simple R Markdown or Quarto document, tracked in Git, is infinitely more reproducible than a macro-heavy Excel workbook. It runs the same way every time. It tells you exactly what it does. And it can be handed to the next person, or to a CRO, or to a regulatory reviewer, without a three-hour briefing.
The Phase 2 math
A Phase 2 program that hits data problems at the submission package stage typically loses four to eight weeks. For a team burning $500K per month in operational costs, that is a $2M to $4M error that was preventable with two months of upfront data work in Phase 1.
More practically: a well-structured Phase 1 dataset takes six to ten weeks to build correctly from scratch, including the pipeline, the documentation, and the QC. Rebuilt from a poorly structured legacy dataset under deadline pressure, with a CRO who was not part of the original work, it takes twice as long and costs three times as much.
Where to start
If your Phase 1 data is already structured in a way that makes you uncomfortable reading this, the practical starting point is not a full rework. It is a data audit: what do you have, what is missing, and what would it take to make the next analysis reproducible by someone who was not in the room when the first one ran.
The audit typically surfaces three or four specific problems, each with a clear fix. That is a two-week project, not a six-month one. And it is a substantially better use of two weeks now than four weeks of fire-fighting nine months from now when the Phase 2 package is due.
Key takeaway
The argument for building CDISC-aligned data infrastructure in Phase 1 is not about regulatory perfectionism. It is about not paying the retrofit tax at the worst possible moment. A clean Phase 1 dataset takes weeks to build. Rebuilding it from legacy data under Phase 2 deadline pressure takes months. The math is straightforward.