Related papers: Do LLMs "know" internally when they follow instructions?

Do LLMs "know" internally when they follow instructions?

URL: http://arxiv.org/abs/2410.14516v5
Date: Fri, 28 Mar 2025 15:40:49 GMT
Title: Do LLMs "know" internally when they follow instructions?
Authors: Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Udhay Nallasamy, Andy Miller, Jaya Narain,
Abstract summary: We investigate whether large language models (LLMs) encode information in their representations that correlate with instruction-following success.<n>Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction.<n>We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes.
Score: 7.87370534634794
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs' internal states relate to these outcomes is required. In this work, we investigate whether LLMs encode information in their representations that correlate with instruction-following success - a property we term knowing internally. Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This work provides insight into the internal workings of LLMs' instruction-following, paving the way for reliable LLM agents.

Related papers

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks. We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective. We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering [66.5524727179286]
NOVA is a framework designed to identify high-quality data that aligns well with the learned knowledge to reduce hallucinations. It includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. To ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity.
arXiv Detail & Related papers (2025-02-11T08:05:56Z)
LLMs can be easily Confused by Instructional Distractions [16.060402139507644]
Large language models show exceptional skill in instruction following tasks. This strength can turn into a vulnerability when the models are required to disregard certain instructions. We introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction.
arXiv Detail & Related papers (2025-02-05T04:52:57Z)
LLM-Generated Black-box Explanations Can Be Adversarially Helpful [16.49758711633611]
Large Language Models (LLMs) help us solve and understand complex problems by acting as digital assistants. Our research uncovers a hidden risk tied to this approach, which we call *adversarial helpfulness*. This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust incorrect solutions.
arXiv Detail & Related papers (2024-05-10T20:23:46Z)
FLAME: Factuality-Aware Alignment for Large Language Models [86.76336610282401]
The conventional alignment process fails to enhance the factual accuracy of large language models (LLMs) We identify factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL) We propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization.
arXiv Detail & Related papers (2024-05-02T17:54:54Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation [66.01754585188739]
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence.
arXiv Detail & Related papers (2024-02-18T04:57:19Z)
Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking [5.755004576310333]
Large Language Model (LLM) assistants have emerged as potential alternatives to search methods for helping users navigate software. LLM assistants use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions.
arXiv Detail & Related papers (2024-02-12T19:49:58Z)
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks [18.068035947969044]
There is considerable confusion about the role of Large Language Models (LLMs) in planning and reasoning tasks. We argue that auto-regressive LLMs cannot, by themselves, do planning or self-verification. We present a vision of bf LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers.
arXiv Detail & Related papers (2024-02-02T14:43:18Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.