WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

arXiv Machine Learning 16 Jun 2026 2 min read

AI Insight

WavSLM is a new speech language model that processes speech through a single stream of tokens by distilling self-supervised WavLM representations into one codebook and using autoregressive prediction. Unlike existing approaches that require text supervision or complex multi-stream architectures, WavSLM jointly models both semantic meaning and acoustic properties of speech within a unified framework. The model achieves competitive performance on speech generation and consistency benchmarks while using fewer parameters and less training data than comparable systems, and it supports real-time streaming inference.

Why it matters

This work simplifies speech AI architectures by demonstrating that effective speech modeling can follow the same single-stream autoregressive approach that succeeded in text-based large language models. The reduced computational requirements and streaming capability could make advanced speech generation more accessible and enable real-time applications like interactive voice assistants.

Confidence

6/10Peer-reviewedAI & Computational Science

arXiv:2603.05299v2 Announce Type: replace
Abstract: Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.

Source: WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation