On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
- URL: http://arxiv.org/abs/2504.05682v1
- Date: Tue, 08 Apr 2025 04:45:00 GMT
- Title: On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
- Authors: Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, Rui Zhao,
- Abstract summary: Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding.<n>In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations.
- Score: 15.971601297360227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Fine-Tuning (RFT) is proved to be greatly valuable for enhancing the reasoning ability of LLMs. Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding. However, these works are at a very early stage and have not examined how suitable RFT actually is for visual tasks. In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations. We start by quantitative comparisons on various tasks, which shows RFT is generally better than SFT on visual tasks. %especially when the number of training samples are limited. To check whether such advantages are brought up by the reasoning process, we design a new reward that encourages the model to ``think'' more, whose results show more thinking can be beneficial for complicated tasks but harmful for simple tasks. We hope this study can provide more insight for the rapid advancements on this topic.
Related papers
- ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.
Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.
We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.<n>Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.<n>We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [8.665713419757061]
We investigate the thinking process in rule-based reinforcement learning fine-tuning (RFT) for multi-modal large language models (MLLMs)
We first propose CLS-RL for classification, using verifiable rewards to encourage MLLM thinking.
Experiments show CLS-RL significantly outperforms SFT and yields a 'free-lunch' generalization effect (improving performance on unseen datasets after training on one dataset).
We then question if this explicit thinking is always necessary for RFT. Challenging convention that explicit thinking is crucial for RFT, we introduce No-Thinking-RL, minimizing thinking via a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - Visual-RFT: Visual Reinforcement Fine-Tuning [75.20572976629646]
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers.<n>Visual-RFT further extends the application areas of RFT on visual tasks.
arXiv Detail & Related papers (2025-03-03T18:16:32Z) - Interpreting and Improving Large Language Models in Arithmetic Calculation [72.19753146621429]
Large language models (LLMs) have demonstrated remarkable potential across numerous applications.
In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations.
We investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance.
arXiv Detail & Related papers (2024-09-03T07:01:46Z) - ReFT: Reasoning with Reinforced Fine-Tuning [9.80361828538909]
We propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning.<n>ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper.<n>Experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT.
arXiv Detail & Related papers (2024-01-17T04:43:21Z) - In Defense of the Learning Without Forgetting for Task Incremental
Learning [91.3755431537592]
Catastrophic forgetting is one of the major challenges on the road for continual learning systems.
This paper shows that using the right architecture along with a standard set of augmentations, the results obtained by LwF surpass the latest algorithms for task incremental scenario.
arXiv Detail & Related papers (2021-07-26T16:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.