Visual-RFT: Visual Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2503.01785v1
- Date: Mon, 03 Mar 2025 18:16:32 GMT
- Title: Visual-RFT: Visual Reinforcement Fine-Tuning
- Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang,
- Abstract summary: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers.<n>Visual-RFT further extends the application areas of RFT on visual tasks.
- Score: 75.20572976629646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
Related papers
- Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [33.127607245587576]
We introduce a framework that enables MLLMs to learn complex visual reasoning from only raw images.<n>We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning.<n>The RRVF-trained model not only outperforms existing MLLMs and supervised fine-tuning baselines but also exhibits superior generalization.
arXiv Detail & Related papers (2025-07-28T12:21:19Z) - VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning [45.39372905700317]
We introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information.<n>With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories.<n>Our approach highlights key limitations of RL in RAG domains.
arXiv Detail & Related papers (2025-05-28T06:30:51Z) - One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z) - Visual Agentic Reinforcement Fine-Tuning [73.37007472426299]
This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
arXiv Detail & Related papers (2025-05-20T11:59:25Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next token prediction is the fundamental principle for training large language models (LLMs)
We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset.
We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.
Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.
We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z) - On the Suitability of Reinforcement Fine-Tuning to Visual Tasks [15.971601297360227]
Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding.
In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations.
arXiv Detail & Related papers (2025-04-08T04:45:00Z) - Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.
Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.
We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)
We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.
OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning [21.931666005603553]
OpenAI's recent introduction of Reinforcement Fine-Tuning (RFT) showcases the potential of reasoning foundation model.<n>This technical report presents emphOpenRFT, our attempt to fine-tune generalist reasoning models for domain-specific tasks.<n>The evaluation is conducted on SciKnowEval, where OpenRFT achieves notable performance gains with only $100$ domain-specific samples for each task.
arXiv Detail & Related papers (2024-12-22T04:21:30Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries.
This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines.
We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - Zero-shot Visual Relation Detection via Composite Visual Cues from Large
Language Models [44.60439935450292]
We propose a novel method for zero-shot visual recognition: RECODE.
It decomposes each predicate category into subject, object, and spatial components.
Different visual cues enhance the discriminability of similar relation categories from different perspectives.
arXiv Detail & Related papers (2023-05-21T14:40:48Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - RPT++: Customized Feature Representation for Siamese Visual Tracking [16.305972000224358]
We argue that the performance gain of visual tracking is limited since features extracted from the salient area provide more recognizable visual patterns for classification.
We propose two customized feature extractors, named polar pooling and extreme pooling to capture task-specific visual patterns.
We demonstrate the effectiveness of the task-specific feature representation by integrating it into the recent and advanced tracker RPT.
arXiv Detail & Related papers (2021-10-23T10:58:57Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.