AI Insight
SpikeProphecy introduces a large-scale benchmarking framework for evaluating autoregressive, causal models that predict the simultaneous spiking activity of neural populations recorded via electrophysiology. Rather than relying on a single Pearson correlation coefficient, the authors propose a decomposed metric system that separately quantifies temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing structure that aggregate scores obscure. Applied across 105 Neuropixels sessions encompassing approximately 89,800 neurons and seven diverse model architectures, the benchmark uncovers a consistent brain-region predictability hierarchy, a biophysically grounded evaluation floor tied to sub-Poisson spike regularity, and a negative result showing that KL-divergence-based distillation is ineffective for transferring knowledge from artificial to spiking neural networks in this domain.
Why it matters
Standardized, decomposed evaluation metrics for neural population forecasting could accelerate progress in brain-computer interfaces and computational neuroscience by enabling more meaningful model comparisons and exposing failure modes that single-number summaries conceal. The public benchmark may also serve as a community resource for developing more biologically faithful neural models.
arXiv:2605.12992v1 Announce Type: new
Abstract: Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $Delta R^2 = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.
Source: SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting