Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention
- URL: http://arxiv.org/abs/2512.10414v1
- Date: Thu, 11 Dec 2025 08:27:02 GMT
- Title: Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention
- Authors: Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li,
- Abstract summary: We propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective.<n>The corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling.
- Score: 29.17663968152743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.
Related papers
- How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics [65.67654005892469]
We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
arXiv Detail & Related papers (2026-02-12T17:11:08Z) - AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - Revisiting Entropy in Reinforcement Learning for Large Reasoning Models [54.96908589622163]
We investigate the entropy dynamics of large language models trained withReinforcement learning with verifiable rewards (RLVR)<n>Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR.
arXiv Detail & Related papers (2025-11-08T12:50:41Z) - Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps [13.28742762414913]
We introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy.<n>Our experiments demonstrate that NCDPO achieves sample efficiency comparable to Proximal Policy (PPO) when training from scratch.
arXiv Detail & Related papers (2025-05-15T16:33:44Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization [16.335645061396455]
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.<n>We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL)<n>Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability.
arXiv Detail & Related papers (2023-11-30T16:53:32Z) - Diversity Actor-Critic: Sample-Aware Entropy Regularization for
Sample-Efficient Exploration [22.539300644593936]
Exploiting the sample distribution obtainable from the replay buffer, the proposed sample-aware entropy regularization maximizes the entropy of the weighted sum of the policy action distribution and the sample action distribution from the replay buffer for sample-efficient exploration.
A practical algorithm named diversity actor-critic (DAC) is developed by applying policy iteration to the objective function with the proposed sample-aware entropy regularization.
arXiv Detail & Related papers (2020-06-02T06:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.