AI Insight
This research addresses a key limitation in Multimodal Large Language Models (MLLMs) where they struggle to identify fine-grained visual details in full images despite being able to recognize those same details when shown cropped regions. The authors introduce Vision-OPD, a self-distillation training method that teaches models to better focus on relevant evidence in full images by having the model learn from its own superior performance on cropped images, without requiring external teacher models or additional computational overhead during inference. Testing on fine-grained visual understanding benchmarks shows Vision-OPD achieves performance competitive with or better than much larger models.
Why it matters
This approach could significantly improve the reliability of AI systems that need to extract precise information from images, such as medical diagnosis tools, quality control systems, or visual question-answering applications. The method is particularly practical because it enhances model performance without requiring additional computational resources during actual deployment.
arXiv:2605.18740v4 Announce Type: replace-cross
Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model’s own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and “Thinking-with-Images” agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD
Source: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation