Home Accueil / Blog / Data Engineering Data Engineering

The hidden cost of manual statistical reporting
in drug development

Le coût caché du reporting statistique manuel
dans le développement de médicaments

TL;DR

Every time an analyst copies a p-value from SAS output into a Word table, a small error becomes possible. Across a drug development program, these small errors compound into a significant liability. Automated reporting eliminates the category entirely.

Le risque copier-coller dans le reporting clinique

Le workflow standard dans la plupart des entreprises pharma et CRO ressemble à ceci : un statisticien exécute du code SAS ou R, scanne la sortie, et copie manuellement les chiffres dans un modèle Word ou PowerPoint. Ce workflow semble rigoureux parce qu'il implique plusieurs personnes. Mais il a un défaut fondamental : la connexion entre l'analyse et le document est humaine.

Quantifier le coût réel

Une estimation raisonnable pour un essai de Phase II de taille moyenne : le package de reporting statistique prend 3-4 semaines à produire manuellement. Environ 40% de ce temps est consacré au QC, spécifiquement à trouver et corriger les erreurs de transcription. Ce coût se multiplie à chaque cycle de reporting: la plupart des programmes en ont 8-12 avant la soumission.

Où les erreurs se produisent vraiment

Les findings d'intégrité des données lors des inspections FDA se regroupent autour de patterns récurrents : incohérences d'arrondi, décalages de version, erreurs de dénominateur, et erreurs de transposition. Aucune de ces erreurs ne nécessite d'intention. Ce sont des conséquences structurelles de la déconnexion entre l'analyse et le document.

L'alternative du reporting automatisé

Dans un pipeline de reporting automatisé, le document est l'analyse. Il n'y a pas d'étape copier-coller parce que les chiffres vont directement du dataset d'analyse à la sortie formatée via le code. Quand les données changent, vous re-rendez. Quand le plan d'analyse change, vous mettez à jour le code et re-rendez.

Implémentation avec R et Quarto

Le principe clé est que chaque chiffre dans le document doit être calculé, pas tapé. R pour le calcul statistique, gt ou flextable pour les tableaux, Quarto pour la génération du document.

adsl <- read_sas(params$dataset)

demog_table <- adsl %>%
  group_by(TRT01P) %>%
  summarise(
    n          = n(),
    age_mean   = round(mean(AGE), 1),
    female_pct = round(mean(SEX == "F") * 100, 1)
  )

gt(demog_table) %>%
  tab_header(title = "Tableau 11.4.1 - Caractéristiques démographiques")

L'argument réglementaire pour l'automatisation

Le guide FDA sur l'intégrité des données (2018) stipule explicitement que les systèmes doivent empêcher les altérations de données non documentées. Un workflow manuel copier-coller viole ce principe par conception. Un pipeline automatisé avec du code versionné et un dataset d'analyse verrouillé satisfait cette exigence structurellement.

Par où commencer

Le chemin le plus rapide vers le reporting automatisé n'est pas de tout reconstruire d'un coup. Choisissez le tableau que vous régénérez le plus souvent après les amendements, généralement le résumé démographique ou le tableau d'endpoint primaire, et automatisez-le en premier. L'investissement est typiquement 2-3 jours pour le premier tableau.


À retenir

Le reporting statistique manuel n'est pas un problème de personnes. C'est un problème de systèmes. L'étape copier-coller est le défaut, et l'automatisation la supprime entièrement. Le ROI est mesurable en heures QC économisées par cycle de reporting.

The copy-paste risk in clinical reporting

The standard workflow in most pharma companies and CROs looks like this: a statistician runs SAS or R code, scans the output, and manually copies numbers into a Word or PowerPoint template. A medical writer reviews the numbers. A QC analyst independently extracts the same numbers and compares. Everyone signs off.

This workflow feels rigorous because it involves multiple people. But it has a fundamental flaw: the connection between the analysis and the document is human. Humans make transcription errors. They copy the wrong cell. They format 0.0023 as 0.023. They paste last week's table into this week's report.

Quantifying the real cost

A reasonable estimate for a mid-size Phase II trial: the statistical reporting package (tables, listings, figures) takes 3-4 weeks to produce manually. Of that time, roughly 40% is spent on QC: specifically on finding and correcting transcription errors. One study, one reporting cycle. Most programs have 8-12 major reporting cycles before submission.

The cost compounds further when an amendment changes the analysis plan. Every table that touches the amended endpoint must be regenerated manually, re-QCed, and re-approved. In a manual workflow, this takes days. In an automated workflow, it takes minutes.

Where errors actually occur

Data integrity findings from FDA inspection reports cluster around a few recurring patterns:

  • Rounding inconsistencies: the same value reported as 12.3% in the text and 12.34% in the table.
  • Version mismatches: a table from an earlier dataset version persists in the final document.
  • Denominator errors: a percentage calculated against the wrong population (ITT vs safety set).
  • Transposition errors: treatment and control columns swapped in a manually built table.

None of these errors require intent. They are structural consequences of disconnecting the analysis from the document.

The automated reporting alternative

In an automated reporting pipeline, the document is the analysis. There is no copy-paste step because the numbers flow directly from the analysis dataset to the formatted output via code. When the data changes, you re-render. When the analysis plan changes, you update the code and re-render. The document is always consistent with the analysis.

Implementation with R and Quarto

The core technology stack is simple: R for statistical computation, gt or flextable for table formatting, and Quarto for document generation. The key principle is that every number in the document must be computed, not typed.

---
title: "Clinical Study Report - Section 11.4"
params:
  dataset: "adsl_v2_locked.sas7bdat"
---

```{r}
library(haven); library(dplyr); library(gt)

adsl <- read_sas(params$dataset)

# Demographic summary - numbers never typed, always computed
demog_table <- adsl %>%
  group_by(TRT01P) %>%
  summarise(
    n        = n(),
    age_mean = round(mean(AGE), 1),
    age_sd   = round(sd(AGE), 1),
    female_n = sum(SEX == "F"),
    female_pct = round(mean(SEX == "F") * 100, 1)
  )

gt(demog_table) %>%
  tab_header(title = "Table 11.4.1 - Demographic Characteristics")
```

When the locked dataset changes (late enrollment, data cleaning), you change one line: the dataset path, and re-render. Every table in the document updates automatically.

The regulatory argument for automation

FDA guidance on data integrity (2018) explicitly states that systems should prevent data alterations that are not documented. A manual copy-paste workflow violates this principle by design: the transfer step is undocumented and uncontrolled.

An automated pipeline with version-controlled code and a locked analysis dataset satisfies this requirement structurally. The code is the documentation of what was computed, and Git provides the audit trail for every change.

Getting started

The fastest path to automated reporting is not to rebuild everything at once. Pick the one table you regenerate most often after data amendments, typically the demographic summary or the primary endpoint table, and automate that first. Prove the concept, then expand.

The investment is typically 2-3 days to set up the infrastructure and template for the first table. After that, each additional table takes hours, not days.


Key takeaway

Manual statistical reporting is not a people problem. It is a systems problem. The copy-paste step is the defect, and automation removes it entirely. The ROI is measurable in QC hours saved per reporting cycle, and the regulatory risk reduction is real.

AM

Aslane Mortreau

Freelance Data & AI specialist working with pharmaceutical, biotech, and cosmetic R&D teams. Statistical modeling, analytical pipelines, and custom applications.

Spécialiste Data & IA freelance travaillant avec des équipes R&D pharmaceutiques, biotech et cosmétiques. Modélisation statistique, pipelines analytiques et applications sur mesure.