CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
- URL: http://arxiv.org/abs/2512.06663v1
- Date: Sun, 07 Dec 2025 05:26:30 GMT
- Title: CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
- Authors: Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang,
- Abstract summary: Chain-of-Thought for Detection (CoT4Det) is a simple but efficient strategy that reformulates perception tasks into three interpretable steps.<n>We show that CoT4Det significantly improves perception performance without compromising general vision language capabilities.
- Score: 53.88194225946438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
Related papers
- Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z) - STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision [24.162895928364062]
We introduce STELAR-Vision, a training framework for topology-aware reasoning.<n>At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures.<n>On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%.
arXiv Detail & Related papers (2025-08-12T07:27:50Z) - AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP [2.869780207429188]
Large language models (LLMs) have shown remarkable progress in reasoning abilities.<n>Yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored.<n>This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs.
arXiv Detail & Related papers (2025-06-10T13:10:31Z) - One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z) - VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning [56.99825489208698]
We introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks.<n> VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model.<n>We evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting.
arXiv Detail & Related papers (2025-05-17T16:51:47Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [34.95077625513563]
We introduce textbfVLM2-Bench, a benchmark designed to assess whether vision-language models can Visually Link Matching cues.<n> Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings.<n>We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap.
arXiv Detail & Related papers (2025-02-17T17:57:50Z) - Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a new multi-task spatial evaluation dataset designed to explore and compare the performance of several advanced models on spatial tasks.<n>The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers.
arXiv Detail & Related papers (2024-08-26T17:25:16Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.