Eliciting Language Model Behaviors with Investigator Agents
- URL: http://arxiv.org/abs/2502.01236v1
- Date: Mon, 03 Feb 2025 10:52:44 GMT
- Title: Eliciting Language Model Behaviors with Investigator Agents
- Authors: Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt,
- Abstract summary: Language models exhibit complex, diverse behaviors when prompted with free-form text.<n>We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors.<n>We train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them.
- Score: 93.34072434845162
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.
Related papers
- HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models [30.596530112268848]
We present the first systematic study of hallucinations in large language models performing long-horizon tasks under scene-task inconsistencies.<n>Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond.
arXiv Detail & Related papers (2025-06-18T02:13:41Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection [0.0]
Identification of hallucination spans in black-box language model generated text is essential for applications in the real world.<n>We present our solution to this problem, which capitalizes on the variability ofally-sampled responses in order to identify hallucinated spans.<n>We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments.
arXiv Detail & Related papers (2025-05-23T05:25:14Z) - What could go wrong? Discovering and describing failure modes in computer vision [27.6114923305978]
We formalize the problem of Language-Based Error Explainability (LBEE)
We propose solutions that operate in a joint vision-and-language embedding space.
We show that the proposed methodology isolates nontrivial sentences associated with specific error causes.
arXiv Detail & Related papers (2024-08-08T14:01:12Z) - From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty [67.81977289444677]
Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions.
We categorize fallback behaviors -- sequence repetitions, degenerate text, and hallucinations -- and extensively analyze them.
Our experiments reveal a clear and consistent ordering of fallback behaviors, across all these axes.
arXiv Detail & Related papers (2024-07-08T16:13:42Z) - Chaos with Keywords: Exposing Large Language Models Sycophantic Hallucination to Misleading Keywords and Evaluating Defense Strategies [47.92996085976817]
This study explores the sycophantic tendencies of Large Language Models (LLMs)
LLMs tend to provide answers that match what users want to hear, even if they are not entirely correct.
arXiv Detail & Related papers (2024-06-06T08:03:05Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [40.930238150365795]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation [51.53917938874146]
We propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction.
Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance.
arXiv Detail & Related papers (2024-04-04T14:45:26Z) - Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries.
We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors.
Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z) - Passive learning of active causal strategies in agents and language
models [15.086300301260811]
We show that purely passive learning can in fact allow an agent to learn generalizable strategies for determining and using causal structures.
We show that agents trained via imitation on expert data can indeed generalize at test time to infer and use causal links which are never present in the training data.
Explanations can even allow passive learners to generalize out-of-distribution from perfectly-confounded training data.
arXiv Detail & Related papers (2023-05-25T15:39:46Z) - Mutual Information Alleviates Hallucinations in Abstractive
Summarization [73.48162198041884]
We find a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty.
This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, when uncertain about a continuation.
We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token--rather than purely the probability of the target token--when the model exhibits uncertainty.
arXiv Detail & Related papers (2022-10-24T13:30:54Z) - Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions.
We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration.
Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.