Related papers: YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

URL: http://arxiv.org/abs/2601.08441v1
Date: Tue, 13 Jan 2026 11:10:13 GMT
Title: YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
Authors: Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang,
Abstract summary: Yet another Policy Optimization (YaPO) is a textitreference-free method that learns textitsparse steering vectors in the latent space of a Sparse Autoencoder.<n>By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions.<n>We show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines.
Score: 56.35317441010461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a \textit{reference-free} method that learns \textit{sparse steering vectors} in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly available\footnote{https://github.com/MBZUAI-Paris/YaPO}.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models [0.2538209532048867]
Our goal is to enable large language models to balance human preference dimensions through principled and controllable alignment.<n>We propose Tangent-Space Direct Preference Optimization (TS-DPO), which performs DPO within a locally linear regime to learn per-objective update directions.
arXiv Detail & Related papers (2026-02-01T09:56:28Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z)
Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment [5.276657230880984]
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences.<n>Direct Optimization Preference (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs.<n>We propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback.<n>Our method consistently outperforms standard DPO on alignment while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
arXiv Detail & Related papers (2025-06-24T16:47:17Z)
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization [58.65395773049273]
Location Preference Optimization (LPO) is a novel approach that leverages locational data to optimize interaction preferences.<n>LPO uses information entropy to predict interaction positions by focusing on zones rich in information.<n>Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
arXiv Detail & Related papers (2025-06-11T03:43:30Z)
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization [17.801062522027266]
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models with human preferences.<n>Existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts.<n>We propose textbfOptimal textbfTransport-based token weighting scheme for enhancing direct textbfPreference textbfOptimization (OTPO)
arXiv Detail & Related papers (2025-05-24T14:44:15Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization [17.365569075274543]
A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift.<n>We develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)<n>We demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
arXiv Detail & Related papers (2025-02-04T02:03:19Z)
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.