Biology

VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design

AI Insight

VibeProteinBench is a newly proposed evaluation benchmark designed to assess the ability of large language models to perform generalist protein design tasks through natural language. The benchmark is structured around three sequential stages that mirror a computational protein design workflow: recognition, engineering, and generation. Evaluation of multiple general-purpose and domain-specialized language models reveals that no current model performs consistently well across all three stages, indicating that open-ended, language-driven protein design remains an unsolved challenge.


A standardized benchmark for language-interfaced protein design could accelerate the development of more capable AI tools for drug discovery, enzyme engineering, and synthetic biology by providing a unified and rigorous framework for measuring progress.


arXiv:2605.10978v2 Announce Type: replace
Abstract: Protein design aims to compose amino-acid sequences that fold into stable three-dimensional structures while satisfying targeted functional properties. The field is increasingly shifting toward vibe protein design, where a single model is expected to generate novel sequences, engineer existing proteins, and reason about protein characteristics through flexible natural-language constraints. Large language models (LLMs) have emerged as a leading paradigm in this space. However, existing evaluation benchmarks often limit their scope to a partial aspect of protein design, while others restrict design objectives to structured input schemas, lacking an integrated framework that evaluates the broad spectrum of protein design competence under open-ended intents. To this end, we present Vibe Protein design Benchmark (VibeProteinBench), a language-interfaced benchmark that probes generalist capabilities through three complementary stages mirroring a computational protein design workflow: recognition, engineering, and generation. Each stage is grounded in expert-curated mechanistic rationales and multi-faceted in silico validation, to computationally verify whether model outputs are biologically plausible. Evaluations across diverse general-purpose and domain-specialized LLMs reveal that no model achieves strong performance across all three stages, suggesting that generalist protein design remains a substantial open challenge for current LLMs.

Source: VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design