LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
- URL: http://arxiv.org/abs/2504.15362v1
- Date: Mon, 21 Apr 2025 18:10:38 GMT
- Title: LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
- Authors: Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna,
- Abstract summary: We introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks.<n>We propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions.<n>We demonstrate notable improvements over existing visual reasoning data-generation methods.
- Score: 105.78609483419115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.
Related papers
- Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [42.407188124841234]
Landscape of thoughts is a tool to inspect the reasoning paths of chain-of-thought on any multi-choice dataset.<n>It distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty.
arXiv Detail & Related papers (2025-03-28T06:09:51Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.
TVC helps the model retain attention to the visual components throughout the reasoning.
Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - MastermindEval: A Simple But Scalable Reasoning Benchmark [3.5519847710183674]
We introduce MastermindEval, a simple, scalable, and interpretable deductive reasoning benchmark inspired by the board game Mastermind.
Our benchmark supports two evaluation paradigms: (1) agentic evaluation, in which the model autonomously plays the game, and (2) deductive reasoning evaluation, in which the model is given a pre-played game state with only one possible valid code to infer.
arXiv Detail & Related papers (2025-03-07T19:24:59Z) - Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z) - LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.
We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.
Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.
Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z) - Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [92.89673285398521]
o1-like reasoning systems have demonstrated remarkable capabilities in solving complex reasoning tasks.
We introduce an imitate, explore, and self-improve'' framework to train the reasoning model.
Our approach achieves competitive performance compared to industry-level reasoning systems.
arXiv Detail & Related papers (2024-12-12T16:20:36Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.