Related papers: Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

URL: http://arxiv.org/abs/2507.11316v1
Date: Tue, 15 Jul 2025 13:48:35 GMT
Title: Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
Authors: Haoran Jin, Meng Li, Xiting Wang, Zhihao Xu, Minlie Huang, Yantao Jia, Defu Lian,
Abstract summary: We introduce a Controlled Value Vector Activation (ConVA) method to align Large Language Models (LLMs) with human values.<n>To consistently control values without sacrificing model performance, we introduce a gated value vector activation method.<n>Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency.
Score: 70.41805604556058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
EAVIT: Efficient and Accurate Human Value Identification from Text data via LLMs [25.093909075330007]
EAVIT is an efficient and accurate framework for human value identification.<n>It combines the strengths of both locally fine-tunable and online black-box LLMs.<n>Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs.
arXiv Detail & Related papers (2025-05-19T07:24:35Z)
Iterative Value Function Optimization for Guided Decoding [20.188412650073225]
Guided decoding, especially value-guided methods, offers a cost-effective alternative to Reinforcement Learning from Human Feedback.<n>The accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making.<n>Existing methods struggle with accurately estimating the optimal value function, leading to less effective control.
arXiv Detail & Related papers (2025-03-04T07:49:10Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs [36.89780636600556]
Large language models (LLMs) have revolutionized text generation. We propose metrics to assess the range, calibration, and consistency of the generated text's attribute intensity.
arXiv Detail & Related papers (2024-06-06T19:35:51Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks. One of their most prominent drawbacks is generating inaccurate or false information with a confident tone. We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.