Related papers: Eliciting Language Model Behaviors with Investigator Agents

Eliciting Language Model Behaviors with Investigator Agents

URL: http://arxiv.org/abs/2502.01236v1
Date: Mon, 03 Feb 2025 10:52:44 GMT
Title: Eliciting Language Model Behaviors with Investigator Agents
Authors: Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt,
Abstract summary: Language models exhibit complex, diverse behaviors when prompted with free-form text.<n>We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors.<n>We train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them.
Score: 93.34072434845162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Related papers

What could go wrong? Discovering and describing failure modes in computer vision [27.6114923305978]
We formalize the problem of Language-Based Error Explainability (LBEE) We propose solutions that operate in a joint vision-and-language embedding space. We show that the proposed methodology isolates nontrivial sentences associated with specific error causes.
arXiv Detail & Related papers (2024-08-08T14:01:12Z)
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty [67.81977289444677]
Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions. We categorize fallback behaviors -- sequence repetitions, degenerate text, and hallucinations -- and extensively analyze them. Our experiments reveal a clear and consistent ordering of fallback behaviors, across all these axes.
arXiv Detail & Related papers (2024-07-08T16:13:42Z)
Chaos with Keywords: Exposing Large Language Models Sycophantic Hallucination to Misleading Keywords and Evaluating Defense Strategies [47.92996085976817]
This study explores the sycophantic tendencies of Large Language Models (LLMs) LLMs tend to provide answers that match what users want to hear, even if they are not entirely correct.
arXiv Detail & Related papers (2024-06-06T08:03:05Z)
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [40.930238150365795]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback. We generate a small-size hallucination annotation dataset by proprietary models. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z)
A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation [51.53917938874146]
We propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction. Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance.
arXiv Detail & Related papers (2024-04-04T14:45:26Z)
Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries. We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z)
Passive learning of active causal strategies in agents and language models [15.086300301260811]
We show that purely passive learning can in fact allow an agent to learn generalizable strategies for determining and using causal structures. We show that agents trained via imitation on expert data can indeed generalize at test time to infer and use causal links which are never present in the training data. Explanations can even allow passive learners to generalize out-of-distribution from perfectly-confounded training data.
arXiv Detail & Related papers (2023-05-25T15:39:46Z)
Mutual Information Alleviates Hallucinations in Abstractive Summarization [73.48162198041884]
We find a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, when uncertain about a continuation. We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token--rather than purely the probability of the target token--when the model exhibits uncertainty.
arXiv Detail & Related papers (2022-10-24T13:30:54Z)
Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions. We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration. Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.