AI & Computational Science

AI System Learns to Explain Its Own Decisions in Plain Language

AI Insight

This paper introduces eXTC (eXplainable Text Classifier), a new approach to text classification that combines natural language rule learning, knowledge distillation, and reinforcement learning in three progressive stages. The system learns a human-readable "Standard Operating Procedure" through structured prompt optimization, distills this knowledge into a compact language model, and then enhances performance through reinforcement learning. eXTC achieves superior classification performance compared to existing methods while providing both local explanations for individual predictions and global interpretability through its learned rulebook.


This work addresses a critical limitation in current AI text classification systems by providing transparency and explainability without sacrificing performance. The approach could enable more trustworthy deployment of language models in high-stakes domains like healthcare, legal analysis, and content moderation where understanding model decisions is essential for accountability and regulatory compliance.


arXiv:2605.29076v2 Announce Type: replace-cross
Abstract: LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.

Source: Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text