Biology

SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

AI Insight

SpikeProphecy introduces a large-scale benchmarking framework for evaluating autoregressive, causal models that predict the simultaneous spiking activity of neural populations recorded via electrophysiology. Rather than relying on a single Pearson correlation coefficient, the authors propose a decomposed metric system that separately quantifies temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing structure that aggregate scores obscure. Applied across 105 Neuropixels sessions encompassing approximately 89,800 neurons and seven diverse model architectures, the benchmark uncovers a consistent brain-region predictability hierarchy, a biophysically grounded evaluation floor tied to sub-Poisson spike regularity, and a negative result showing that KL-divergence-based distillation is ineffective for transferring knowledge from artificial to spiking neural networks in this domain.


Standardized, decomposed evaluation metrics for neural population forecasting could accelerate progress in brain-computer interfaces and computational neuroscience by enabling more meaningful model comparisons and exposing failure modes that single-number summaries conceal. The public benchmark may also serve as a community resource for developing more biologically faithful neural models.


arXiv:2605.12992v1 Announce Type: new
Abstract: Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $Delta R^2 = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.

Source: SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting