AI in drug discovery:<br><em>what the hype gets wrong</em>

Ce qu'AlphaFold a vraiment prouvé

AlphaFold 2 a prouvé que certains problèmes de prédiction biologique bien définis peuvent être résolus à grande échelle par l'apprentissage profond. La prédiction de structure protéique est un tel problème parce qu'il a une entrée claire, une sortie claire, et des décennies de structures expérimentales pour l'entraînement. La plupart de la découverte de médicaments ne ressemble pas à ça.

Où l'IA ajoute genuinement de la valeur

Les domaines où les applications IA ont livré de la valeur réelle : prédiction de propriétés moléculaires, identification de hits et criblage virtuel, planification de synthèse, criblage phénotypique basé sur l'image, et découverte de biomarqueurs. Ces tâches partagent des caractéristiques : entrées bien définies, sorties mesurables, données d'entraînement de qualité.

Où l'IA est surestimée

La conception de médicaments de novo produit beaucoup de molécules qui semblent bonnes sur papier. Les taux de succès expérimentaux quand elles sont synthétisées et testées ne sont pas manifestement meilleurs que les méthodes traditionnelles. L'automatisation de bout en bout du pipeline est limitée par le fait que le goulot d'étranglement n'a jamais été la génération de molécules: c'était la compréhension mécanistique et la validation expérimentale.

Le problème de qualité des données dont personne ne parle

Chaque application ML en découverte de médicaments est contrainte par la qualité de ses données d'entraînement. Les données de bioactivité dans les bases de données publiques contiennent des erreurs systématiques : variabilité des essais, biais de publication. Les modèles entraînés sur ces données apprennent des patterns qui reflètent partiellement la biologie et partiellement les artefacts des essais.

Un cadrage plus honnête

L'IA dans la découverte de médicaments est un ensemble d'outils puissants de reconnaissance de patterns appliqués à un domaine où la plupart des problèmes difficiles ne sont pas des problèmes de reconnaissance de patterns. La version honnête : nous pouvons maintenant faire certaines tâches computationnelles plus vite et moins cher qu'avant. Ça réduit certains goulots d'étranglement. Ça ne change pas la difficulté fondamentale de prédire l'efficacité et la sécurité cliniques chez l'humain.

À retenir

L'IA ajoute de la valeur réelle en découverte de médicaments pour les tâches de prédiction bien définies avec de bonnes données d'entraînement. Elle ne change pas le défi fondamental : prédire quelles molécules seront sûres et efficaces chez l'humain est un problème scientifique, pas un problème de données.

What AlphaFold actually proved

AlphaFold 2 is a genuine scientific achievement. It solved the protein structure prediction problem, given a sequence of amino acids, predict the three-dimensional folded structure, at near-experimental accuracy for most proteins. This took decades of effort by thousands of researchers, and a deep learning system solved it.

What AlphaFold proved is that some well-defined biological prediction problems can be solved at scale by deep learning. Protein structure prediction is such a problem because it has a clear input (sequence), a clear output (3D coordinates), and decades of experimental structures to train on. The loss function is unambiguous. The benchmark is unambiguous.

The well-defined problem problem

Most of drug discovery does not look like this. Consider target identification: given a disease phenotype, which proteins should you try to modulate? This is not a prediction problem with a clear output. It is a causal inference problem with incomplete mechanistic knowledge, confounded observational data, and no ground truth, you only discover whether you were right years later, in humans.

Or consider ADMET prediction (absorption, distribution, metabolism, excretion, toxicity): predicting whether a molecule will be toxic in humans from its structure alone. Models can be trained on in vitro data. But in vitro toxicity does not reliably predict in vivo toxicity, which does not reliably predict human toxicity. The prediction problem is technically tractable. The scientific problem is not.

Where AI genuinely adds value in drug discovery

There are domains where AI applications have delivered real, reproducible value:

Molecular property prediction: predicting physicochemical properties (solubility, logP, Tpsa) from molecular structure. Well-defined inputs, measurable outputs, large training datasets. GNNs and transformer-based molecular models work well here.
Hit identification and virtual screening: ranking large compound libraries against a known target structure. Faster and cheaper than wet-lab screening for known target classes.
Synthesis planning: predicting feasible synthetic routes for target molecules. Commercially deployed by Synthia (now Merck) and others.
Image-based phenotypic screening: analyzing high-content microscopy data to identify compounds that produce desired cell morphology changes. Computer vision on well-controlled experimental data.
Biomarker discovery: identifying molecular signatures associated with disease subtypes or treatment response in omics data. Primarily a pattern recognition task.

Where AI is being oversold

The claims that are being stretched beyond what the evidence supports:

De novo drug design: generative models that design novel drug candidates from scratch have produced many molecules that look good on paper. Experimental hit rates when synthesized and tested are not obviously better than traditional methods. The models are better at generating drug-like molecules than at generating drugs.
End-to-end pipeline automation: the idea that AI can take you from target to clinical candidate with minimal human intervention. The bottleneck has never been molecule generation: it has been mechanistic understanding, experimental validation, and ADMET. AI has not solved these.
Reducing clinical failure rates: 90% of drugs fail in clinical trials. AI-designed candidates are now entering trials. We will know in 5-10 years whether failure rates have changed. The current confidence expressed in press releases is not supported by clinical outcomes data.

The data quality problem nobody talks about

Every ML application in drug discovery is constrained by the quality of its training data. Bioactivity data in public databases (ChEMBL, BindingDB) contains systematic errors: assay variability, inconsistent units, duplicate measurements with contradictory values, reporting bias (positive results are overrepresented).

Models trained on this data learn patterns that partially reflect biology and partially reflect assay artifacts. The validation benchmarks used in academic papers often use the same contaminated data as training. A model that achieves 90% accuracy on such a benchmark is not necessarily 90% accurate on prospective predictions.

What this means for AI drug discovery teams

The useful framing is not "can AI discover drugs?" but "which specific tasks in drug discovery are well-defined prediction problems with high-quality training data and unambiguous benchmarks?" Those tasks, and there are genuine ones: are where AI tools should be applied.

For everything else, AI provides useful signals that need to be interpreted by scientists with mechanistic domain knowledge. The value is in augmentation, not replacement. The scientist's job does not disappear. It shifts toward better experimental design, better data curation, and better interpretation of model outputs.

A more honest framing

AI in drug discovery is a set of powerful pattern-recognition tools being applied to a domain where most of the hard problems are not pattern-recognition problems. The tools are genuinely useful for the subset of problems they fit. The error is claiming they fit more of the problem than they do.

The honest version of the AI drug discovery story is this: we can now do certain computational tasks faster and more cheaply than before. That reduces some bottlenecks. It does not change the fundamental difficulty of predicting clinical efficacy and safety in humans.

Key takeaway

AI adds real value in drug discovery for well-defined prediction tasks with good training data. It does not change the fundamental challenge: predicting which molecules will be safe and effective in humans is a scientific problem, not a data problem. Until we have better mechanistic models, better data, and outcomes from AI-designed drugs in clinical trials, the hype is running ahead of the evidence.

Aslane Mortreau

Freelance Data & AI specialist working with pharmaceutical, biotech, and cosmetic R&D teams. Statistical modeling, analytical pipelines, and custom applications.

Portfolio LinkedIn Book a call Prendre RDV

AI in drug discovery:
what the hype gets wrong

L'IA dans la découverte de médicaments :
ce que le hype se trompe

Ce qu'AlphaFold a vraiment prouvé

Où l'IA ajoute genuinement de la valeur

Où l'IA est surestimée

Le problème de qualité des données dont personne ne parle

Un cadrage plus honnête

What AlphaFold actually proved

The well-defined problem problem

Where AI genuinely adds value in drug discovery

Where AI is being oversold

The data quality problem nobody talks about

What this means for AI drug discovery teams

A more honest framing

Aslane Mortreau

AI in drug discovery:what the hype gets wrong

L'IA dans la découverte de médicaments :ce que le hype se trompe

Ce qu'AlphaFold a vraiment prouvé

Où l'IA ajoute genuinement de la valeur

Où l'IA est surestimée

Le problème de qualité des données dont personne ne parle

Un cadrage plus honnête

What AlphaFold actually proved

The well-defined problem problem

Where AI genuinely adds value in drug discovery

Where AI is being oversold

The data quality problem nobody talks about

What this means for AI drug discovery teams

A more honest framing

Aslane Mortreau

Related articles

Articles connexes

Biological computing vs SNNs: what CL1 means for computational neuroscience

Computing biologique vs SNN : ce que CL1 signifie pour la neuroscience

What a pharma R&D team actually needs from a freelance data scientist

Ce qu'une équipe R&D pharma attend d'un data scientist freelance

The GxP data science stack for a 5-person biotech team

Le stack data science GxP pour une équipe biotech de 5 personnes

AI in drug discovery:
what the hype gets wrong

L'IA dans la découverte de médicaments :
ce que le hype se trompe