OPRD: On-Policy Representation Distillation

arXiv AI 5 Jun 2026 2 min read

AI Insight

This paper introduces On-Policy Representation Distillation (OPRD), a method for training smaller AI language models by aligning their internal hidden states with those of larger teacher models, rather than only matching output token probabilities. The approach eliminates sampling variance inherent in traditional output-space distillation and provides richer structural information by supervising multiple layers simultaneously. Experiments show OPRD closes the performance gap between student and teacher models on mathematical reasoning benchmarks while being 1.44x faster and using 54% less memory than existing methods.

Why it matters

This technique could enable more efficient deployment of large language models by creating smaller, faster versions that maintain comparable performance. The memory and speed improvements make advanced AI capabilities more accessible for resource-constrained applications while preserving reasoning abilities on complex mathematical tasks.

Confidence

6/10Peer-reviewedAI & Computational Science

Understand the Science

Language model Concept coming soon Distillation (machine learning) Concept coming soon

arXiv:2606.06021v1 Announce Type: cross
Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen’s ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

Source: OPRD: On-Policy Representation Distillation

Continue Exploring

Key Concepts

Distillation (machine learning) Language model

Latest Research

A Machine-Learning-Based Gas Lift Optimization Workflow for Unconventional Fields 29 Jul 2026 RDQ: Residual Distribution Quantization for Large Language Models 29 Jul 2026 Sparse Tensor Processing Speeds Up AI Transformer Models on Vector Chips 29 Jul 2026

Source
arXiv AI