The three moments where CDISC errors originate
CDISC errors are usually not mistakes in the CDISC mapping itself. They originate at three upstream decision points:
- Data collection design: a case report form that collects data in a format that does not map cleanly to SDTM. Composite fields ("pain at rest / pain on movement" in a single text field), non-standard terminology that does not align with CDISC CT, and missing timing information are the most common issues.
- Database structure: clinical databases that store related concepts in ways that require complex joins to reconstruct a single SDTM domain. A single domain observation may need data from 5-6 database tables.
- Mapping approach: choosing to map non-standard data collection to standard SDTM structures rather than flagging the deviation, which creates technically valid but scientifically misleading representations.
SDTM mistake 1: topic variable contamination
Every SDTM domain has a topic variable: the primary descriptor of what was measured or observed. For PC (pharmacokinetics), it is PCTESTCD. For AE (adverse events), it is AETERM.
Contamination occurs when non-topic information is embedded in the topic variable: "Nausea (Grade 2)" instead of "Nausea" in AETERM (severity belongs in AESEV), or "Plasma glucose fasting" in LBTESTCD (the method belongs in LBMETHOD).
# WRONG: method information in LBTESTCD
data.frame(
LBTESTCD = "GLUCFAST", # "fasting" is a method modifier, not part of testcd
LBORRES = "5.2"
)
# CORRECT
data.frame(
LBTESTCD = "GLUC",
LBMETHOD = "FASTING",
LBORRES = "5.2"
)
SDTM mistake 2: timing variable chaos
SDTM has a rich timing model (--DTC, --STDTC, --ENDTC, --TPT, --TPTNUM, --TPTREF, --RFTDTC). The most common mistake is inconsistency: using PCTPT as a free-text field with values like "Pre-dose", "30 min", "1 hour" and "1hr" in the same column, or confusing the nominal time (PCTPTNUM) with the actual collection time (PCDTC).
library(dplyr)
# Standardize timing variables before SDTM mapping
pc_clean <- raw_pk %>%
mutate(
# Nominal time: always numeric hours from dose
PCTPTNUM = as.numeric(nominal_time_h),
# Actual collection datetime: ISO 8601
PCDTC = format(as.POSIXct(collection_datetime), "%Y-%m-%dT%H:%M"),
# Text label: controlled term from CDISC CT
PCTPT = case_when(
PCTPTNUM == 0 ~ "PREDOSE",
PCTPTNUM == 0.5 ~ "30 MIN POST-DOSE",
PCTPTNUM == 1 ~ "1 HR POST-DOSE",
TRUE ~ paste0(PCTPTNUM, " HR POST-DOSE")
)
)
ADaM mistake 1: DTYPE misuse
DTYPE is an ADaM variable used to identify derived records: imputed values, LOCF carries, or analysis-specific records that do not appear in SDTM. The common mistake is using DTYPE to identify records that should have been created differently, for example, using DTYPE="BASELINE" to flag baseline records rather than using the ABLFL flag, or creating DTYPE="DERIVED" for records that are not derived but simply from a different timepoint.
# CORRECT baseline flagging in ADaM
adlb <- adlb %>%
group_by(USUBJID, PARAMCD) %>%
mutate(
# ABLFL: baseline record flag ("Y" or "")
ABLFL = if_else(
VISIT == "BASELINE" & !is.na(AVAL),
"Y", ""
),
# BASE: baseline value carried to all records
BASE = AVAL[ABLFL == "Y"][1]
) %>%
ungroup()
ADaM mistake 2: population flag inconsistencies
Every clinical trial has multiple analysis populations: intent-to-treat (ITT), per-protocol (PP), safety. These are defined by ADSL flags (ITTFL, PPFL, SAFFL). The mistake occurs when:
- The flag definitions in ADSL do not match the SAP definitions exactly.
- Different ADaM datasets implement slightly different population logic for the same population.
- A subject who should be in the safety population is excluded due to a missing flag rather than an explicit exclusion criterion.
# ADSL population flags should be derived once and joined everywhere
adsl <- adsl %>%
mutate(
ITTFL = if_else(!is.na(TRTSDT), "Y", "N"),
SAFFL = if_else(ITTFL == "Y" & at_least_one_dose, "Y", "N"),
PPFL = if_else(SAFFL == "Y" & no_major_protocol_dev, "Y", "N")
)
# Every analysis dataset joins from ADSL - never recomputes
adlb <- adlb %>%
left_join(
adsl %>% select(USUBJID, ITTFL, SAFFL, PPFL),
by = "USUBJID"
)
Automated validation with Pinnacle 21
Pinnacle 21 Community validates CDISC datasets against FDA and PMDA conformance rules. Run it before any internal review, it catches structural errors that manual review misses.
The output is a report of findings at three levels: errors (will cause submission rejection), warnings (should be addressed), and notices (informational). Target: zero errors, minimal warnings with documented rationale for each.
The define.xml that reviewers actually read
The define.xml is the metadata file that describes every dataset, variable, and code list in the submission. FDA reviewers use it to navigate the data packages. A poorly written define.xml, missing variable labels, missing code list definitions, broken links to analysis datasets, slows review and triggers information requests.
Key requirements: every variable must have a label (not just a name), every code list must be defined with all used values, and all derived variables in ADaM must have their derivation documented in the comments field.
A practical QC checklist
Before submitting a CDISC data package:
- Zero Pinnacle 21 errors
- All timing variables consistent within and across domains
- Topic variables free of method or severity modifiers
- ADSL population flags match SAP definitions exactly
- All ADaM datasets join population flags from ADSL, not recomputed
- DTYPE used only for derived records, not as a general labeling mechanism
- Define.xml variable labels match ADaM specification sheet
- All code list values used in data are defined in define.xml
Key takeaway
CDISC errors cluster around a small number of recurring patterns. The most damaging ones originate in data collection design decisions, not in the CDISC mapping itself. The fix is to involve CDISC expertise early in the study design, not just at the mapping stage.