VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
- URL: http://arxiv.org/abs/2510.23497v2
- Date: Tue, 28 Oct 2025 11:09:37 GMT
- Title: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
- Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid,
- Abstract summary: VOLD is a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.<n>We show that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.
- Score: 67.98620973023709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
Related papers
- OVD: On-policy Verbal Distillation [47.727229201069555]
On-policy Verbal Distillation (OVD) is a memory-efficient framework that replaces token-level probability matching with trajectory matching.<n>OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback.
arXiv Detail & Related papers (2026-01-29T16:48:14Z) - Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models [44.041109669153506]
On-Policy Self-Distillation (OPSD) is a framework where a single model acts as both teacher and student by conditioning on different contexts.<n>We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2026-01-26T17:56:50Z) - Online In-Context Distillation for Low-Resource Vision Language Models [16.3054668860198]
Small vision-language models (VLMs) are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain.<n>We propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time.<n>Our method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations.
arXiv Detail & Related papers (2025-10-20T21:35:17Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Leveraging Vision-Language Models for Improving Domain Generalization in
Image Classification [35.277880733198586]
Vision-Language Models (VLMs) are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions.
We propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model.
This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings.
arXiv Detail & Related papers (2023-10-12T11:59:54Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.