Related papers: Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

URL: http://arxiv.org/abs/2602.14078v1
Date: Sun, 15 Feb 2026 10:05:03 GMT
Title: Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning
Authors: Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet,
Abstract summary: We propose a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning.<n>We evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.
Score: 18.440289150575648
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods outperform CE-based methods across diverse benchmarks and with various PEFT modules. More broadly, we evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.

Related papers

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z)
Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models [18.084251607403406]
Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level.<n>We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation.<n> EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
arXiv Detail & Related papers (2026-02-03T09:38:21Z)
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning [28.02073546326571]
Policy entropy reflects the balance between exploration and exploitation during training.<n>Existing methods discard valuable gradient signals from low-probability tokens due to the clipping mechanism.<n>We propose textbfCoordinating textbfEntropy via textbfGradient textbfPreserving textbfPolicy textbfOptimization.
arXiv Detail & Related papers (2025-09-25T03:22:04Z)
Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation [1.7811840395202345]
Deepfake (DF) detectors face significant challenges when deployed in real-world environments.<n>Postprocessing techniques can obscure generation artifacts presented in DF samples, leading to performance degradation.<n>We propose Think Twice before Adaptation (textttT$2$A), a novel online test-time adaptation method.
arXiv Detail & Related papers (2025-05-24T16:58:53Z)
Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models [52.8949080772873]
We propose an evolution-based region adversarial prompt tuning method called ER-APT.<n>In each training iteration, we first generate AEs using traditional gradient-based methods.<n> Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs.<n>The final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning.
arXiv Detail & Related papers (2025-03-17T07:08:47Z)
Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach [11.878350833222711]
We propose a method called em GradSamp for sampling gradient updates from a Gaussian distribution. em GradSamp not only streamlines gradient but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models.
arXiv Detail & Related papers (2024-06-11T15:01:20Z)
Gradient Projection For Continual Parameter-Efficient Tuning [42.800411328615894]
We reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection. We show that the condition for the gradient can effectively resist forgetting even for large-scale models. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets.
arXiv Detail & Related papers (2024-05-22T06:33:48Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes. We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z)
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS) A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z)
An Investigation of the Bias-Variance Tradeoff in Meta-Gradients [53.28925387487846]
Hessian estimation always adds bias and can also add variance to meta-gradient estimation. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction.
arXiv Detail & Related papers (2022-09-22T20:33:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.