ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2602.03425v1
- Date: Tue, 03 Feb 2026 11:49:46 GMT
- Title: ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning
- Authors: Xiaofeng Tan, Jun Liu, Yuanting Fan, Bin-Bin Gao, Xi Jiang, Xiaochen Chen, Jinlong Peng, Chengjie Wang, Hongsong Wang, Feng Zheng,
- Abstract summary: Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment.<n>RFT often introduce visual hallucinations like over-optimized details and semantic misalignment.<n>This work preliminarily explores why visual hallucinations arise and how to reduce them.
- Score: 85.20505958752928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model's foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model's consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49\% for low-level and 38\% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1\% (v.s. the baseline's decrease of -0.4\%) over FLUX1.dev. This is \href{https://xiaofeng-tan.github.io/projects/ConsistentRFT}{Project Page}.
Related papers
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z) - Unbiased Gradient Estimation for Event Binning via Functional Backpropagation [64.88399635309918]
We propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation.<n>We achieve 9.4% lower EPE in self-supervised optical flow, and 5.1% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception.
arXiv Detail & Related papers (2026-02-13T04:05:03Z) - Toward Generalizable Deblurring: Leveraging Massive Blur Priors with Linear Attention for Real-World Scenarios [9.82847623835017]
GLOWDeblur is a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction & domain alignment module with a lightweight diffusion backbone.<n>We propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data.<n>We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction &
arXiv Detail & Related papers (2026-01-10T11:01:31Z) - DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z) - Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion [0.0]
Current paradigms impose a static compromise on information channel drift parameters.<n>We introduce a "low-dimensional" exploration model and a "low-dimensional" dynamic alignment framework.
arXiv Detail & Related papers (2025-12-01T17:04:17Z) - HADSF: Aspect Aware Semantic Control for Explainable Recommendation [4.75127493865044]
Recent advances in large language models (LLMs) promise more effective information extraction for recommender systems.<n>We propose a two-stage approach that induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples.<n> Experiments on approximately 3 million reviews spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error.
arXiv Detail & Related papers (2025-10-30T20:49:33Z) - On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting [91.38734024438357]
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs)<n>Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data.<n>We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting.
arXiv Detail & Related papers (2025-08-15T11:20:03Z) - FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation [6.5370850242187855]
Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources.<n>Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance.<n>Existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates.
arXiv Detail & Related papers (2025-05-24T04:12:12Z) - Flow-GRPO: Training Flow Matching Models via Online RL [80.62659379624867]
We propose Flow-GRPO, the first method to integrate online policy reinforcement learning into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps.
arXiv Detail & Related papers (2025-05-08T17:58:45Z) - ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation [50.80115710105251]
Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation.<n>We propose a residual-based paradigm for estimating HTR optical flow with event data.
arXiv Detail & Related papers (2024-12-12T09:35:47Z) - BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution [52.47005445345593]
BlindDiff is a DM-based blind SR method to tackle the blind degradation settings in SISR.
BlindDiff seamlessly integrates the MAP-based optimization into DMs.
Experiments on both synthetic and real-world datasets show that BlindDiff achieves the state-of-the-art performance.
arXiv Detail & Related papers (2024-03-15T11:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.