Home Accueil / Blog / Regulatory Réglementaire

CDISC in practice:
common mistakes and how to avoid them

CDISC en pratique :
erreurs courantes et comment les éviter

TL;DR

Most CDISC compliance issues are not caused by missing knowledge of the standard. They are caused by decisions made before anyone looks at the CDISC guide, in the data collection design, the database structure, and the mapping approach. These are the mistakes that sink submissions.

Les trois moments où les erreurs CDISC trouvent leur origine

Les erreurs CDISC trouvent généralement leur origine dans la conception de la collecte des données, la structure de la base de données, ou l'approche de mapping, pas dans la connaissance du standard lui-même. Les réparer à ce stade est beaucoup moins coûteux qu'après la soumission.

Erreur SDTM 1 : contamination de la variable topic

La contamination survient quand des informations non-topic sont intégrées dans la variable topic : "Nausée (Grade 2)" au lieu de "Nausée" dans AETERM (la sévérité appartient à AESEV), ou la méthode d'analyse dans LBTESTCD.

# INCORRECT : information méthode dans LBTESTCD
LBTESTCD = "GLUCFAST"   # "à jeun" est un modificateur, pas le test

# CORRECT
LBTESTCD = "GLUC"
LBMETHOD = "FASTING"

Erreur SDTM 2 : chaos des variables de timing

L'erreur la plus courante est l'incohérence : utiliser PCTPT comme champ texte libre avec des valeurs comme "Pré-dose", "30 min", "1 heure" et "1h" dans la même colonne. Standardisez toujours les variables de timing avant le mapping SDTM.

Erreur ADaM 1 : mauvaise utilisation de DTYPE

DTYPE identifie les enregistrements dérivés (valeurs imputées, LOCF). L'erreur courante est d'utiliser DTYPE pour identifier les enregistrements de baseline au lieu d'utiliser le flag ABLFL.

Erreur ADaM 2 : incohérences des flags de population

Les flags de population dans ADSL (ITTFL, PPFL, SAFFL) doivent être dérivés une seule fois et joints depuis ADSL dans chaque dataset d'analyse. Ne jamais les recalculer.

adlb <- adlb %>%
  left_join(
    adsl %>% select(USUBJID, ITTFL, SAFFL, PPFL),
    by = "USUBJID"
  )

Une checklist QC pratique

Avant de soumettre un package CDISC : zéro erreur Pinnacle 21, variables de timing cohérentes, variables topic sans modificateurs, flags de population correspondant exactement aux définitions du SAP, DTYPE utilisé uniquement pour les enregistrements dérivés, labels de variables define.xml correspondant aux spécifications ADaM.


À retenir

Les erreurs CDISC se regroupent autour d'un petit nombre de patterns récurrents. Les plus dommageables trouvent leur origine dans les décisions de conception de la collecte des données, pas dans le mapping CDISC lui-même. Le remède est d'impliquer l'expertise CDISC tôt dans la conception de l'étude.

The three moments where CDISC errors originate

CDISC errors are usually not mistakes in the CDISC mapping itself. They originate at three upstream decision points:

  • Data collection design: a case report form that collects data in a format that does not map cleanly to SDTM. Composite fields ("pain at rest / pain on movement" in a single text field), non-standard terminology that does not align with CDISC CT, and missing timing information are the most common issues.
  • Database structure: clinical databases that store related concepts in ways that require complex joins to reconstruct a single SDTM domain. A single domain observation may need data from 5-6 database tables.
  • Mapping approach: choosing to map non-standard data collection to standard SDTM structures rather than flagging the deviation, which creates technically valid but scientifically misleading representations.

SDTM mistake 1: topic variable contamination

Every SDTM domain has a topic variable: the primary descriptor of what was measured or observed. For PC (pharmacokinetics), it is PCTESTCD. For AE (adverse events), it is AETERM.

Contamination occurs when non-topic information is embedded in the topic variable: "Nausea (Grade 2)" instead of "Nausea" in AETERM (severity belongs in AESEV), or "Plasma glucose fasting" in LBTESTCD (the method belongs in LBMETHOD).

# WRONG: method information in LBTESTCD
data.frame(
  LBTESTCD = "GLUCFAST",   # "fasting" is a method modifier, not part of testcd
  LBORRES   = "5.2"
)

# CORRECT
data.frame(
  LBTESTCD  = "GLUC",
  LBMETHOD  = "FASTING",
  LBORRES   = "5.2"
)

SDTM mistake 2: timing variable chaos

SDTM has a rich timing model (--DTC, --STDTC, --ENDTC, --TPT, --TPTNUM, --TPTREF, --RFTDTC). The most common mistake is inconsistency: using PCTPT as a free-text field with values like "Pre-dose", "30 min", "1 hour" and "1hr" in the same column, or confusing the nominal time (PCTPTNUM) with the actual collection time (PCDTC).

library(dplyr)

# Standardize timing variables before SDTM mapping
pc_clean <- raw_pk %>%
  mutate(
    # Nominal time: always numeric hours from dose
    PCTPTNUM = as.numeric(nominal_time_h),
    # Actual collection datetime: ISO 8601
    PCDTC    = format(as.POSIXct(collection_datetime), "%Y-%m-%dT%H:%M"),
    # Text label: controlled term from CDISC CT
    PCTPT    = case_when(
      PCTPTNUM == 0     ~ "PREDOSE",
      PCTPTNUM == 0.5   ~ "30 MIN POST-DOSE",
      PCTPTNUM == 1     ~ "1 HR POST-DOSE",
      TRUE              ~ paste0(PCTPTNUM, " HR POST-DOSE")
    )
  )

ADaM mistake 1: DTYPE misuse

DTYPE is an ADaM variable used to identify derived records: imputed values, LOCF carries, or analysis-specific records that do not appear in SDTM. The common mistake is using DTYPE to identify records that should have been created differently, for example, using DTYPE="BASELINE" to flag baseline records rather than using the ABLFL flag, or creating DTYPE="DERIVED" for records that are not derived but simply from a different timepoint.

# CORRECT baseline flagging in ADaM
adlb <- adlb %>%
  group_by(USUBJID, PARAMCD) %>%
  mutate(
    # ABLFL: baseline record flag ("Y" or "")
    ABLFL = if_else(
      VISIT == "BASELINE" & !is.na(AVAL),
      "Y", ""
    ),
    # BASE: baseline value carried to all records
    BASE  = AVAL[ABLFL == "Y"][1]
  ) %>%
  ungroup()

ADaM mistake 2: population flag inconsistencies

Every clinical trial has multiple analysis populations: intent-to-treat (ITT), per-protocol (PP), safety. These are defined by ADSL flags (ITTFL, PPFL, SAFFL). The mistake occurs when:

  • The flag definitions in ADSL do not match the SAP definitions exactly.
  • Different ADaM datasets implement slightly different population logic for the same population.
  • A subject who should be in the safety population is excluded due to a missing flag rather than an explicit exclusion criterion.
# ADSL population flags should be derived once and joined everywhere
adsl <- adsl %>%
  mutate(
    ITTFL  = if_else(!is.na(TRTSDT), "Y", "N"),
    SAFFL  = if_else(ITTFL == "Y" & at_least_one_dose, "Y", "N"),
    PPFL   = if_else(SAFFL == "Y" & no_major_protocol_dev, "Y", "N")
  )

# Every analysis dataset joins from ADSL - never recomputes
adlb <- adlb %>%
  left_join(
    adsl %>% select(USUBJID, ITTFL, SAFFL, PPFL),
    by = "USUBJID"
  )

Automated validation with Pinnacle 21

Pinnacle 21 Community validates CDISC datasets against FDA and PMDA conformance rules. Run it before any internal review, it catches structural errors that manual review misses.

The output is a report of findings at three levels: errors (will cause submission rejection), warnings (should be addressed), and notices (informational). Target: zero errors, minimal warnings with documented rationale for each.

The define.xml that reviewers actually read

The define.xml is the metadata file that describes every dataset, variable, and code list in the submission. FDA reviewers use it to navigate the data packages. A poorly written define.xml, missing variable labels, missing code list definitions, broken links to analysis datasets, slows review and triggers information requests.

Key requirements: every variable must have a label (not just a name), every code list must be defined with all used values, and all derived variables in ADaM must have their derivation documented in the comments field.

A practical QC checklist

Before submitting a CDISC data package:

  • Zero Pinnacle 21 errors
  • All timing variables consistent within and across domains
  • Topic variables free of method or severity modifiers
  • ADSL population flags match SAP definitions exactly
  • All ADaM datasets join population flags from ADSL, not recomputed
  • DTYPE used only for derived records, not as a general labeling mechanism
  • Define.xml variable labels match ADaM specification sheet
  • All code list values used in data are defined in define.xml

Key takeaway

CDISC errors cluster around a small number of recurring patterns. The most damaging ones originate in data collection design decisions, not in the CDISC mapping itself. The fix is to involve CDISC expertise early in the study design, not just at the mapping stage.

AM

Aslane Mortreau

Freelance Data & AI specialist working with pharmaceutical, biotech, and cosmetic R&D teams. Statistical modeling, analytical pipelines, and custom applications.

Spécialiste Data & IA freelance travaillant avec des équipes R&D pharmaceutiques, biotech et cosmétiques. Modélisation statistique, pipelines analytiques et applications sur mesure.