Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
- URL: http://arxiv.org/abs/2601.06993v1
- Date: Sun, 11 Jan 2026 17:07:47 GMT
- Title: Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
- Authors: Jie Zhu, Yiyang Su, Xiaoming Liu,
- Abstract summary: Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
- Score: 18.16727716373833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at \href{https://github.com/jiezhu23/ReFine-RFT}{Project Link}.
Related papers
- Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning [23.364264811510598]
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs)<n>We introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images.<n>Our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT.
arXiv Detail & Related papers (2026-01-21T08:09:25Z) - Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization [55.6995787502694]
We study how different Chain-of-language (CoT) designs affect the acquisition of the generalizable visual reasoning ability.<n>We compare three representative CoT formats: Language CoT, Grounding CoT, and Visual CoT.<n>Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling.
arXiv Detail & Related papers (2025-11-27T16:19:34Z) - Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding [23.138205646078536]
Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks.<n>We find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks.<n>We propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union rewards.
arXiv Detail & Related papers (2025-11-17T21:22:50Z) - TFRank: Think-Free Reasoning Enables Practical Pointwise LLM Ranking [21.930228130429573]
Reasoning-intensive ranking models built on Large Language Models (LLMs) have made notable progress.<n>Existing approaches often rely on large-scale LLMs and explicit Chain-of-Thought (CoT) reasoning.<n>We propose textbfTFRank, an efficient pointwise reasoning ranker based on small-scale LLMs.
arXiv Detail & Related papers (2025-08-13T06:47:58Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models [42.75418134743927]
Reason-RFT is a two-stage reinforcement fine-tuning framework for visual reasoning.<n>First,Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of Vision-Language Models (VLMs)<n>Second, reinforcement learning based on Group Relative Policy Optimization (GRPO) generates multiple reasoning-response pairs to enhance adaptability to domain shifts.
arXiv Detail & Related papers (2025-03-26T17:38:06Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Visual-RFT: Visual Reinforcement Fine-Tuning [75.20572976629646]
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers.<n>Visual-RFT further extends the application areas of RFT on visual tasks.
arXiv Detail & Related papers (2025-03-03T18:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.