Related papers: Non-Linear Inference Time Intervention: Improving LLM Truthfulness

Non-Linear Inference Time Intervention: Improving LLM Truthfulness

URL: http://arxiv.org/abs/2403.18680v2
Date: Thu, 6 Jun 2024 13:58:20 GMT
Title: Non-Linear Inference Time Intervention: Improving LLM Truthfulness
Authors: Jakub Hoscilowicz, Adam Wiacek, Jan Chojnacki, Adam Cieslak, Leszek Michon, Vitalii Urbanevych, Artur Janicki,
Abstract summary: We develop the Inference Time Intervention (ITI) framework, which lets bias LLM without the need for fine-tuning. The improvement manifests in introducing a non-linear multi-token probing and multi-token intervention. We report over 16% relative MC1 improvement with respect to the baseline ITI results.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we explore LLM's internal representation space to identify attention heads that contain the most truthful and accurate information. We further developed the Inference Time Intervention (ITI) framework, which lets bias LLM without the need for fine-tuning. The improvement manifests in introducing a non-linear multi-token probing and multi-token intervention: Non-Linear ITI (NL-ITI), which significantly enhances performance on evaluation benchmarks. NL-ITI is tested on diverse multiple-choice datasets, including TruthfulQA, on which we report over 16% relative MC1 (accuracy of model pointing to the correct answer) improvement with respect to the baseline ITI results. Moreover, we achieved a 10% relative improvement over the recently released Truth Forest (TrFf) method that also focused on ITI improvement.

Related papers

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection [11.277049921075026]
Out-of-distribution (OOD) detection has seen significant advancements with zero-shot approaches. We propose a novel strategy to enhance zero-shot OOD detection performances for both Far-OOD and Near-OOD scenarios. We introduce novel few-shot prompt tuning and visual prompt tuning to adapt the proposed framework to better align with the target distribution.
arXiv Detail & Related papers (2025-01-09T13:36:37Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z)
Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models [11.563813473794013]
Open-sourced large language models (LLMs) exhibit limited effectiveness in processing Vietnamese. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese.
arXiv Detail & Related papers (2024-03-05T07:13:28Z)
GRATH: Gradual Self-Truthifying for Large Language Models [63.502835648056305]
GRAdual self-truTHifying (GRATH) is a novel post-processing method to enhance truthfulness of large language models (LLMs) GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.
arXiv Detail & Related papers (2024-01-22T19:00:08Z)
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [61.88942482411035]
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs) ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
arXiv Detail & Related papers (2023-06-06T01:26:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.