FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
- URL: http://arxiv.org/abs/2504.14492v2
- Date: Sat, 05 Jul 2025 16:08:29 GMT
- Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
- Authors: Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu,
- Abstract summary: Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts.<n>We propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining.
- Score: 12.65682270967556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.
Related papers
- Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior [58.751981587234916]
This paper formalizes the Refinement Provenance Inference (RPI) audit task as Refinement Provenance Inference (RPI)<n>We propose RePro, a logit-based framework that fuses teacher-forced likelihood features with logit-ranking signals.<n>During training, RePro learns a transferable representation via shadow fine-tuning, and uses a lightweight linear head to infer provenance on unseen victims without training-data access.
arXiv Detail & Related papers (2026-01-05T10:16:41Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models [10.388673049493947]
TRoVe is an automated approach for discovering error-inducing static feature biases.<n>We show that TRoVe can accurately identify error-inducing static feature biases in vision-language models.
arXiv Detail & Related papers (2025-11-30T19:36:46Z) - INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models [2.509305596181814]
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor.<n>We present textbfINSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help.
arXiv Detail & Related papers (2025-10-01T19:22:48Z) - Steering When Necessary: Flexible Steering Large Language Models with Backtracking [16.23081952791394]
Large language models (LLMs) have achieved remarkable performance across many generation tasks.<n> Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage.<n>We propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention.
arXiv Detail & Related papers (2025-08-25T03:01:30Z) - Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs [0.5076419064097734]
Large language models (LLMs) are increasingly integrated into societal systems.<n>Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation.<n>We introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias.
arXiv Detail & Related papers (2025-08-12T15:34:18Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Investigating task-specific prompts and sparse autoencoders for activation monitoring [0.0]
Internal activations of language models encode additional information that could be useful for this.
Recent work has proposed several approaches which may improve on naive linear probing.
We develop and test novel refinements of these methods and compare them against each other.
arXiv Detail & Related papers (2025-04-28T21:28:17Z) - Robust Partial-Label Learning by Leveraging Class Activation Values [0.0]
Real-world training data is often noisy; for example, human annotators assign conflicting class labels to the same instances.
We propose a novel method based on subjective logic, which explicitly represents uncertainty by leveraging the magnitudes of the underlying neural network's class activation values.
We empirically show that our method yields more robust predictions in terms of predictive performance under high noise levels.
arXiv Detail & Related papers (2025-02-17T12:30:05Z) - Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model [41.55165760439727]
Vision-language models (VLMs) have revolutionized machine learning by leveraging large pre-trained models to tackle various downstream tasks.<n>We propose a graph-based approach for label-efficient adaptation and inference.<n>Our method dynamically constructs a graph over text prompts, few-shot examples, and test samples, using label propagation for inference without task-specific tuning.
arXiv Detail & Related papers (2024-12-24T09:15:00Z) - Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [8.761404991620285]
Activation intervention has emerged as an effective and economical method to modify the behavior of large language models (LLMs)<n>We propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time.<n> Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training.
arXiv Detail & Related papers (2024-10-16T06:58:49Z) - Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings.
We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa.
Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z) - Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Steering Language Models With Activation Engineering [40.04138190785384]
We introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs.
We achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT.
ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks.
arXiv Detail & Related papers (2023-08-20T12:21:05Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - D-CALM: A Dynamic Clustering-based Active Learning Approach for
Mitigating Bias [13.008323851750442]
In this paper, we propose a novel adaptive clustering-based active learning algorithm, D-CALM, that dynamically adjusts clustering and annotation efforts.
Experiments on eight datasets for a diverse set of text classification tasks, including emotion, hatespeech, dialog act, and book type detection, demonstrate that our proposed algorithm significantly outperforms baseline AL approaches.
arXiv Detail & Related papers (2023-05-26T15:17:43Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [58.617025733655005]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Constraining Representations Yields Models That Know What They Don't
Know [2.729898906885749]
A well-known failure mode of neural networks is that they may confidently return erroneous predictions.
This work presents a novel direction to address these issues in a broad, general manner.
We assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code.
We train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class.
arXiv Detail & Related papers (2022-08-30T18:28:00Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - WSLRec: Weakly Supervised Learning for Neural Sequential Recommendation
Models [24.455665093145818]
We propose a novel model-agnostic training approach called WSLRec, which adopts a three-stage framework: pre-training, top-$k$ mining, intrinsic and fine-tuning.
WSLRec resolves the incompleteness problem by pre-training models on extra weak supervisions from model-free methods like BR and ItemCF, while resolving the inaccuracy problem by leveraging the top-$k$ mining to screen out reliable user-item relevance from weak supervisions for fine-tuning.
arXiv Detail & Related papers (2022-02-28T08:55:12Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.