Related papers: But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

URL: http://arxiv.org/abs/2505.17760v2
Date: Thu, 06 Nov 2025 11:07:41 GMT
Title: But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
Authors: Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen,
Abstract summary: Judge Using Safety-Steered Alternatives (JUSSA) is a framework that employs steering vectors during inference to generate more honest alternatives.<n>We evaluate JUSSA on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation.<n>Our work opens new directions for scalable model auditing as systems become increasingly sophisticated.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting subtle forms of dishonesty like sycophancy and manipulation in Large Language Models (LLMs) remains challenging for both humans and automated evaluators, as these behaviors often appear through small biases rather than clear false statements. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a novel framework that employs steering vectors not to improve model behavior directly, but to enhance LLM judges' evaluation capabilities. JUSSA applies steering vectors during inference to generate more honest alternatives, providing judges with contrastive examples that make subtle dishonest patterns easier to detect. While existing evaluation methods rely on black-box evaluation, JUSSA leverages model internals to create targeted comparisons from single examples. We evaluate our method on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation. Our results demonstrate that JUSSA effectively improves detection accuracy over single-response evaluation in various cases. Analysis across judge models reveals that JUSSA helps weaker judges on easier dishonesty detection tasks, and stronger judges on harder tasks. Layer-wise experiments show how dishonest prompts cause representations to diverge from honest ones in middle layers, revealing where steering interventions are most effective for generating contrastive examples. By demonstrating that steering vectors can enhance safety evaluation rather than just modify behavior, our work opens new directions for scalable model auditing as systems become increasingly sophisticated.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens [9.127363793428119]
We show that short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes'' judgments.<n>We show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives.
arXiv Detail & Related papers (2025-12-19T09:22:11Z)
A Granular Study of Safety Pretraining under Model Abliteration [64.24346997570275]
We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions.<n>We issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity.<n>Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration.
arXiv Detail & Related papers (2025-10-03T07:01:45Z)
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification [28.63435151584449]
We propose SelfJudge, which trains judge verifiers via self-supervision of the target model.<n>Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses.
arXiv Detail & Related papers (2025-09-26T02:21:12Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows [41.97051158610974]
We present a systematic analysis of agentic robustness under deceptive or misleading feedback.<n>We reveal that even strongest agents are vulnerable to persuasive yet flawed critiques.<n>Our findings highlight fundamental vulnerabilities in feedback-based robustness and offer guidance for building more robust agentic systems.
arXiv Detail & Related papers (2025-06-03T19:26:23Z)
Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models [21.698247799954654]
Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments.<n>This paper presents a systematic analysis of how current MLLMs handle implicit reasoning scenarios.<n>We find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills.
arXiv Detail & Related papers (2025-05-30T21:47:28Z)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.<n>In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
"Deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others.<n>We find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content.
arXiv Detail & Related papers (2025-02-12T11:02:59Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses. Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Dishonesty in Helpful and Harmless Alignment [26.123327022999124]
Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses.
arXiv Detail & Related papers (2024-06-04T03:31:09Z)
Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient. We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z)
Towards a Practical Defense against Adversarial Attacks on Deep Learning-based Malware Detectors via Randomized Smoothing [3.736916304884177]
We propose a practical defense against adversarial malware examples inspired by randomized smoothing. In our work, instead of employing Gaussian or Laplace noise when randomizing inputs, we propose a randomized ablation-based smoothing scheme. We have empirically evaluated the proposed ablation-based model against various state-of-the-art evasion attacks on the BODMAS dataset.
arXiv Detail & Related papers (2023-08-17T10:30:25Z)
Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result. We find that relying on average performance to judge capabilities can be highly misleading. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z)
Few-shot Forgery Detection via Guided Adversarial Interpolation [56.59499187594308]
Existing forgery detection methods suffer from significant performance drops when applied to unseen novel forgery approaches. We propose Guided Adversarial Interpolation (GAI) to overcome the few-shot forgery detection problem. Our method is validated to be robust to choices of majority and minority forgery approaches.
arXiv Detail & Related papers (2022-04-12T16:05:10Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.