Related papers: Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

URL: http://arxiv.org/abs/2511.05705v1
Date: Fri, 07 Nov 2025 20:50:54 GMT
Title: Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
Authors: David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi,
Abstract summary: We introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions.<n>We show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks.
Score: 70.23466957404891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

Related papers

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods [41.49799689399879]
We introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens.<n>The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces.<n>Our models establish new state-of-the-art results for their size class.
arXiv Detail & Related papers (2026-01-29T15:07:28Z)
VERITAS: Leveraging Vision Priors and Expert Fusion to Improve Multimodal Data [3.638465758795032]
VERITAS is a pipeline that integrates vision priors and multiple state-of-the-art LMMs to enhance SFT data quality.<n>Three LMMs evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score.<n>Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient.
arXiv Detail & Related papers (2025-10-17T05:13:50Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation [64.23194519770897]
We build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions.<n>We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset.<n>We train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities.
arXiv Detail & Related papers (2025-08-18T07:24:33Z)
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance [29.94723846950853]
We conduct the first comprehensive analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk.<n>We derive statistics that reveal structural and qualitative similarities and differences between the two datasets.<n>Our findings offer actionable insights for constructing more effective post-training datasets.
arXiv Detail & Related papers (2025-06-06T20:34:06Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning [99.33607114541861]
We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation.
arXiv Detail & Related papers (2024-08-15T22:27:00Z)
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.