Related papers: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

URL: http://arxiv.org/abs/2306.03341v6
Date: Wed, 26 Jun 2024 14:11:53 GMT
Title: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Authors: Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg,
Abstract summary: We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs) ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
Score: 61.88942482411035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

Related papers

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications [21.698669254520475]
This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset.<n>We show that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks.<n>We propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review.
arXiv Detail & Related papers (2025-07-01T08:04:58Z)
LUNAR: LLM Unlearning via Neural Activation Redirection [20.60687563657169]
Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but they increasingly incur the risk of leaking private information. We propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. We show that LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference.
arXiv Detail & Related papers (2025-02-11T03:23:22Z)
Boosting LLM-based Relevance Modeling with Distribution-Aware Robust Learning [14.224921308101624]
We propose a novel Distribution-Aware Robust Learning framework (DaRL) for relevance modeling. DaRL has been deployed online to serve the Alipay's insurance product search.
arXiv Detail & Related papers (2024-12-17T03:10:47Z)
Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z)
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z)
LLM In-Context Recall is Prompt Dependent [0.0]
A model's ability to do this significantly influences its practical efficacy and dependability in real-world applications. This study demonstrates that an LLM's recall capability is not only contingent upon the prompt's content but also may be compromised by biases in its training data.
arXiv Detail & Related papers (2024-04-13T01:13:59Z)
Test-Time Zero-Shot Temporal Action Localization [58.84919541314969]
ZS-TAL seeks to identify and locate actions in untrimmed videos unseen during training. Training-based ZS-TAL approaches assume the availability of labeled data for supervised learning. We introduce a novel method that performs Test-Time adaptation for Temporal Action localization (T3AL)
arXiv Detail & Related papers (2024-04-08T11:54:49Z)
Non-Linear Inference Time Intervention: Improving LLM Truthfulness [0.0]
We develop the Inference Time Intervention (ITI) framework, which lets bias LLM without the need for fine-tuning. The improvement manifests in introducing a non-linear multi-token probing and multi-token intervention. We report over 16% relative MC1 improvement with respect to the baseline ITI results.
arXiv Detail & Related papers (2024-03-27T15:22:16Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task. LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced. Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.