Related papers: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

URL: http://arxiv.org/abs/2510.23497v2
Date: Tue, 28 Oct 2025 11:09:37 GMT
Title: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid,
Abstract summary: VOLD is a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.<n>We show that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.
Score: 67.98620973023709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

Related papers

OVD: On-policy Verbal Distillation [47.727229201069555]
On-policy Verbal Distillation (OVD) is a memory-efficient framework that replaces token-level probability matching with trajectory matching.<n>OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback.
arXiv Detail & Related papers (2026-01-29T16:48:14Z)
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models [44.041109669153506]
On-Policy Self-Distillation (OPSD) is a framework where a single model acts as both teacher and student by conditioning on different contexts.<n>We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2026-01-26T17:56:50Z)
Online In-Context Distillation for Low-Resource Vision Language Models [16.3054668860198]
Small vision-language models (VLMs) are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain.<n>We propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time.<n>Our method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations.
arXiv Detail & Related papers (2025-10-20T21:35:17Z)
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z)
Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z)
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification [35.277880733198586]
Vision-Language Models (VLMs) are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. We propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings.
arXiv Detail & Related papers (2023-10-12T11:59:54Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs) Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.