StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

arXiv 18 May 2026 2 min read

AI Insight

StateXDiff is a computational framework that predicts how individual cells respond to drug treatments by combining gene expression data (transcriptomics) with inferred protein-level information within a diffusion-based generative model. The system introduces two key innovations: a Virtual Multimodal Cell State that enriches RNA-based cellular representations with protein context, and a Mechanism-aware Drug-Gene Template that integrates multi-source biological knowledge to better represent drug actions. Evaluated across three challenging out-of-distribution scenarios — unseen cell lines, unseen drugs, and drug combinations — StateXDiff demonstrates improved generalization compared to existing approaches.

Why it matters

Accurate prediction of single-cell drug responses could accelerate drug discovery and reduce costly experimental screening by enabling virtual testing of drug effects across diverse cell types. This is particularly relevant for personalized medicine, where understanding cell-state-specific drug responses is critical.

Confidence

5/10Peer-reviewedBiology

arXiv:2605.16104v1 Announce Type: new
Abstract: Predicting drug-induced cellular state changes at single-cell resolution remains a central challenge in virtual cell modeling, particularly under out-of-distribution (OOD) conditions. Current approaches predominantly rely on RNA-based assays, which often fail to adequately capture the diverse cellular states underlying drug responses. Moreover, conditional distribution shifts and low signal-to-noise ratios frequently cause models to learn spurious correlations rather than genuine state transitions. To address these limitations, we introduce StateXDiff, a cell State-contextualized multimodal (X) Diffusion framework for predicting single-cell responses to drug perturbations. The framework operates sequentially: first, it learns a disentangled, multimodal representation of cellular state by integrating transcriptomic profiles with inferred protein features; second, it employs a conditional diffusion model to generate perturbation-specific changes. Our approach introduces a Virtual Multimodal Cell State, which augments RNA-based representations with protein-level context, and a Mechanism-aware Drug-Gene Template, which consolidates multi-source biological knowledge for accurate drug representation. Generation is driven by a latent-space diffusion Transformer, regularized through quality-aware triplet constraints, including positive drug-protein pairs or protein-drug mismatched pairs, and explicit protein-reliability weighting. Extensive evaluation demonstrates that StateXDiff consistently enhances generalization performance across three challenging settings: unseen cell lines, unseen drugs, and combinatorial perturbations.

Source: StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

Source
arXiv