CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
- URL: http://arxiv.org/abs/2509.22010v3
- Date: Wed, 01 Oct 2025 09:15:14 GMT
- Title: CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
- Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou,
- Abstract summary: Chain of Foresight-Focus Thought (CoFFT) is a training-free approach that enhances visual reasoning by emulating human visual cognition.<n>These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning.<n> Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
- Score: 61.34272727005052
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
Related papers
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model [82.26044667101011]
Vision-aligned Latent Reasoning (VaLR) is a framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step.<n>VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders.
arXiv Detail & Related papers (2026-02-04T12:04:02Z) - VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs [7.406217790017003]
We introduce VisRes Bench, a benchmark to study visual reasoning in naturalistic settings without contextual language supervision.<n>Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities.<n>We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
arXiv Detail & Related papers (2025-12-24T14:18:38Z) - ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models [11.263321053154364]
ERGO is a reasoning-driven perception-leveraging multimodal context to determine where to focus.<n>We develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception.<n>Our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency.
arXiv Detail & Related papers (2025-09-26T07:15:19Z) - VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.<n>We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.<n>Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.<n>Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [59.92786855289658]
We introduce a novel visual reasoning framework named ProReason.<n>ProReason features decoupled vision-reasoning capabilities and multi-run proactive perception.<n>Our experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks.
arXiv Detail & Related papers (2024-10-18T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.