AI Insight
This benchmark study evaluated whether larger AI models (large language models and pretrained molecular sequence models) outperform smaller, specialized models in drug discovery tasks involving molecular property, toxicity, and biological activity prediction. Across 26 endpoints and 156 cross-validation comparisons using random, scaffold-based, and structure-separated splits, classical machine learning models such as Random Forest with ECFP4 fingerprints dominated overall, winning 116 of the comparisons, while LLM-based approaches won only three. The study concludes that predictive performance in molecular property prediction depends on the alignment between model type, task, and validation scenario rather than on model scale alone, though larger models show relative value for SAR interpretation in low-data settings.
Why it matters
These findings challenge the prevailing assumption that scaling up AI models automatically improves drug discovery workflows, suggesting that resource-intensive large models may be unnecessary for routine molecular property prediction where compact, specialized models remain highly competitive. This has direct implications for computational resource allocation and model selection strategies in pharmaceutical research and cheminformatics pipelines.
arXiv:2604.26498v2 Announce Type: replace-cross
Abstract: The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.