Million-Variant Benchmark Accelerates AI-Driven Protein Engineering Discovery

arXiv 3 Jun 2026 2 min read

AI Insight

Researchers have developed TadA-Bench, a benchmark dataset containing one million protein variants from 31 rounds of directed evolution experiments designed to test AI systems' ability to predict which protein variants should be tested in future laboratory experiments. The benchmark preserves the chronological order of experiments and challenges models to rank variants from later experimental rounds using only data from earlier rounds, revealing that while AI models perform well on randomly split data, they struggle significantly at predicting future experimental candidates. The dataset includes DNA, RNA, and protein sequence views and employs a graph-based method to reconcile noisy experimental measurements into consistent activity labels across rounds.

Why it matters

This benchmark addresses a critical gap in developing autonomous AI systems for protein engineering that can actively guide experimental design rather than just analyze existing data. It provides a reproducible framework for evaluating whether AI models can genuinely accelerate scientific discovery by prioritizing the most promising candidates for costly wet-lab experiments.

Confidence

6/10Peer-reviewedBiology

Understand the Science

Benchmark (computing) Concept coming soon Protein engineering Concept coming soon Directed evolution Concept coming soon

arXiv:2606.02624v1 Announce Type: new
Abstract: AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

Source: TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering