Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting
- URL: http://arxiv.org/abs/2305.13733v2
- Date: Thu, 7 Mar 2024 03:11:47 GMT
- Title: Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting
- Authors: Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Boyang Xue, Kam-Fai Wong,
Ruifeng Xu
- Abstract summary: This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
- Score: 55.15697111170836
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Numerous works are proposed to align large language models (LLMs) with human
intents to better fulfill instructions, ensuring they are trustful and helpful.
Nevertheless, some human instructions are often malicious or misleading and
following them will lead to untruthful and unsafe responses. Previous work
rarely focused on understanding how LLMs manage instructions based on
counterfactual premises, referred to here as \textit{inductive instructions},
which may stem from users' false beliefs or malicious intents. In this paper,
we aim to reveal the behaviors of LLMs towards \textit{inductive instructions}
and enhance their truthfulness and helpfulness accordingly. Specifically, we
first introduce a benchmark of \underline{\textbf{Indu}}ctive
{In\underline{\textbf{st}}ruct}ions (\textsc{\textbf{INDust}}), where the false
knowledge is incorporated into instructions in multiple different styles. After
extensive human and automatic evaluations, we uncovered a universal
vulnerability among LLMs in processing inductive instructions. Additionally, we
identified that different inductive styles affect the models' ability to
identify the same underlying errors, and the complexity of the underlying
assumptions also influences the model's performance. Motivated by these
results, we propose \textsc{Dual-critique} prompting to improve LLM robustness
against inductive instructions. Our experiments demonstrate that
\textsc{Dual-critique} prompting significantly bolsters the robustness of a
diverse array of LLMs, even when confronted with varying degrees of inductive
instruction complexity and differing inductive styles.
Related papers
- An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models [99.31449616860291]
Modern language models (LMs) can learn to perform new tasks in different ways.
In instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly.
In instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description.
arXiv Detail & Related papers (2024-04-03T19:31:56Z) - RoCoIns: Enhancing Robustness of Large Language Models through
Code-Style Instructions [43.19966425619236]
We utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions.
Under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples.
Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions.
arXiv Detail & Related papers (2024-02-26T09:30:55Z) - Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs.
Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z) - Auto-Instruct: Automatic Instruction Generation and Ranking for
Black-Box Language Models [91.02730155418699]
Large language models (LLMs) can perform a wide range of tasks by following natural language instructions.
We introduce Auto-Instruct, a novel method to automatically improve the quality of instructions provided to LLMs.
In experiments on 118 out-of-domain tasks, Auto-Instruct surpasses both human-written instructions and existing baselines of LLM-generated instructions.
arXiv Detail & Related papers (2023-10-19T19:52:55Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models [28.37026309925163]
Large language models (LLMs) are designed to align with human values and generate safe text.
Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models.
This paper assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach.
arXiv Detail & Related papers (2023-07-17T13:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.