Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading
Comprehension Shortcut Triggers
- URL: http://arxiv.org/abs/2310.18360v1
- Date: Tue, 24 Oct 2023 12:37:06 GMT
- Title: Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading
Comprehension Shortcut Triggers
- Authors: Mosh Levy, Shauli Ravfogel, Yoav Goldberg
- Abstract summary: Shortcuts, mechanisms triggered by features spuriously correlated to the true label, has emerged as a potential threat to Machine Reading (MRC) systems.
We introduce a framework that guides an editor to add potential shortcuts-triggers to samples.
Using GPT4 as the editor, we find it can successfully edit trigger shortcut in samples that fool LLMs.
- Score: 76.77077447576679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent applications of LLMs in Machine Reading Comprehension (MRC) systems
have shown impressive results, but the use of shortcuts, mechanisms triggered
by features spuriously correlated to the true label, has emerged as a potential
threat to their reliability. We analyze the problem from two angles: LLMs as
editors, guided to edit text to mislead LLMs; and LLMs as readers, who answer
questions based on the edited text. We introduce a framework that guides an
editor to add potential shortcuts-triggers to samples. Using GPT4 as the
editor, we find it can successfully edit trigger shortcut in samples that fool
LLMs. Analysing LLMs as readers, we observe that even capable LLMs can be
deceived using shortcut knowledge. Strikingly, we discover that GPT4 can be
deceived by its own edits (15% drop in F1). Our findings highlight inherent
vulnerabilities of LLMs to shortcut manipulations. We publish ShortcutQA, a
curated dataset generated by our framework for future research.
Related papers
- Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models [9.854718405054589]
Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks.
This paper presents Shortcut Suite, a test suite designed to evaluate the impact of shortcuts on LLMs' performance.
arXiv Detail & Related papers (2024-10-17T08:52:52Z) - LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - Large Language Models as Carriers of Hidden Messages [0.0]
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query.
Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction.
We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates.
arXiv Detail & Related papers (2024-06-04T16:49:06Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Aligning LLMs for FL-free Program Repair [14.935596175148586]
This paper investigates a new approach to adapt large language models (LLMs) to program repair.
Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective.
Based on this insight, we designed D4C, a straightforward prompting framework for APR.
arXiv Detail & Related papers (2024-04-13T02:36:40Z) - DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection [50.805599761583444]
Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles.
We propose Dell that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline.
arXiv Detail & Related papers (2024-02-16T03:24:56Z) - Why and When LLM-Based Assistants Can Go Wrong: Investigating the
Effectiveness of Prompt-Based Interactions for Software Help-Seeking [5.755004576310333]
Large Language Model (LLM) assistants have emerged as potential alternatives to search methods for helping users navigate software.
LLM assistants use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions.
arXiv Detail & Related papers (2024-02-12T19:49:58Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning [5.822010906632045]
This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research.
By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis.
arXiv Detail & Related papers (2023-07-05T10:15:07Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.