Unveiling Markov Heads in Pretrained Language Models for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2409.06985v2
- Date: Fri, 06 Jun 2025 19:11:05 GMT
- Title: Unveiling Markov Heads in Pretrained Language Models for Offline Reinforcement Learning
- Authors: Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian,
- Abstract summary: We show that knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL)<n>PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge has been transferred to RL to achieve such good results?<n>We propose a general method GPT2-DTMA, which equips a pretrained DT with Mixture of Attention (MoA) to accommodate diverse attention requirements during fine-tuning.
- Score: 10.631495275246428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT2-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to accommodate diverse attention requirements during fine-tuning. Extensive experiments corroborate our theorems and demonstrate the effectiveness of GPT2-DTMA: it achieves comparable performance in short-term environments while significantly narrowing the performance gap in long-term environments.
Related papers
- Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning [106.68304931854038]
Reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs)<n>We conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity.<n>Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns.<n>In the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences.
arXiv Detail & Related papers (2025-08-04T10:08:10Z) - AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection [9.555955025064895]
We propose AttentionInfluence to identify reasoning-intensive pretraining data.<n>Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation.<n>Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks.
arXiv Detail & Related papers (2025-05-12T07:25:51Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models [52.8949080772873]
We propose an evolution-based region adversarial prompt tuning method called ER-APT.
In each training iteration, we first generate AEs using traditional gradient-based methods.
Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs.
The final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning.
arXiv Detail & Related papers (2025-03-17T07:08:47Z) - Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks.
To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z) - TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining [34.93043212352875]
TapWeight is a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective.
We applied TapWeight to both molecular property prediction and natural language understanding tasks, significantly surpassing baseline methods.
arXiv Detail & Related papers (2024-10-13T20:56:13Z) - Meta-TTT: A Meta-learning Minimax Framework For Test-Time Training [5.9631503543049895]
Test-time domain adaptation is a challenging task that aims to adapt a pre-trained model to limited, unlabeled target data during inference.
This paper introduces a meta-learning minimax framework for test-time training on batch normalization layers.
arXiv Detail & Related papers (2024-10-02T16:16:05Z) - How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks? [9.515532265294187]
Self-supervised pre-training has proven highly effective for many computer vision tasks.
It remains unclear under which conditions pre-trained models offer significant advantages over training from scratch.
arXiv Detail & Related papers (2024-09-27T08:15:14Z) - Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT.
Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework.
Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning [55.88910947643436]
We propose a unified framework for continual learning (CL) with pre-trained models (PTMs) and parameter-efficient tuning (PET)
We present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimize the objective through incorporating task-specific and task-shared knowledge.
Our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.
arXiv Detail & Related papers (2024-07-07T01:50:25Z) - Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - Evolving Domain Adaptation of Pretrained Language Models for Text
Classification [24.795214770636534]
Adapting pre-trained language models (PLMs) for time-series text classification amidst evolving domain shifts (EDS) is critical for maintaining accuracy in applications like stance detection.
This study benchmarks the effectiveness of evolving domain adaptation (EDA) strategies, notably self-training, domain-adversarial training, and domain-adaptive pretraining, with a focus on an incremental self-training method.
arXiv Detail & Related papers (2023-11-16T08:28:00Z) - Domain Adaptation with Adversarial Training on Penultimate Activations [82.9977759320565]
Enhancing model prediction confidence on unlabeled target data is an important objective in Unsupervised Domain Adaptation (UDA)
We show that this strategy is more efficient and better correlated with the objective of boosting prediction confidence than adversarial training on input images or intermediate features.
arXiv Detail & Related papers (2022-08-26T19:50:46Z) - PGTRNet: Two-phase Weakly Supervised Object Detection with Pseudo Ground
Truth Refining [10.262660606897974]
Weakly Supervised Object Detection (WSOD) aiming to train detectors with only image-level annotations has arisen increasing attention.
Current state-of-the-art approaches mainly follow a two-stage training strategy whichintegrates a fully supervised detector (FSD) with a pure WSOD model.
There are two main problems hindering the performance of the two-phase WSOD approaches, i.e., insufficient learning problem and strict reliance between the FSD and the pseudo ground truth generated by theWSOD model.
This paper proposes pseudo ground truth refinement network (PGTRNet), a simple yet effective method
arXiv Detail & Related papers (2021-08-25T19:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.