But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
- URL: http://arxiv.org/abs/2505.17760v1
- Date: Fri, 23 May 2025 11:34:02 GMT
- Title: But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
- Authors: Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen,
- Abstract summary: We introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models.<n>We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent safety evaluations of Large Language Models (LLMs) show that many models exhibit dishonest behavior, such as sycophancy. However, most honesty benchmarks focus exclusively on factual knowledge or explicitly harmful behavior and rely on external judges, which are often unable to detect less obvious forms of dishonesty. In this work, we introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models, helping LLM-judges in the detection of dishonest behavior. To test our framework, we introduce a new manipulation dataset with prompts specifically designed to elicit deceptive responses. We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.
Related papers
- Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models [21.698247799954654]
Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments.<n>This paper presents a systematic analysis of how current MLLMs handle implicit reasoning scenarios.<n>We find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills.
arXiv Detail & Related papers (2025-05-30T21:47:28Z) - Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.<n>In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z) - Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
"Deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others.<n>We find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content.
arXiv Detail & Related papers (2025-02-12T11:02:59Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models.
BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses.
Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Dishonesty in Helpful and Harmless Alignment [26.123327022999124]
Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference.
We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses.
arXiv Detail & Related papers (2024-06-04T03:31:09Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result.
We find that relying on average performance to judge capabilities can be highly misleading.
We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.