Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
- URL: http://arxiv.org/abs/2505.15436v1
- Date: Wed, 21 May 2025 12:18:15 GMT
- Title: Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
- Authors: Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li,
- Abstract summary: Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
- Score: 70.1326027641056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
Related papers
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning [43.8114307203968]
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images.<n>In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO)<n>MGPO enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images.
arXiv Detail & Related papers (2025-07-08T12:05:05Z) - Kwai Keye-VL Technical Report [80.53170317017147]
We introduce textbfKwai Keye-VL, a multimodal foundation model for short-video understanding.<n>The development of Keye-VL rests on two core pillars: a massive, high-quality dataset with a strong emphasis on video, and an innovative training recipe.<n>To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks.
arXiv Detail & Related papers (2025-07-02T17:57:28Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category.
Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks.
We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.