Related papers: Language Models Identify Ambiguities and Exploit Loopholes

Language Models Identify Ambiguities and Exploit Loopholes

URL: http://arxiv.org/abs/2508.19546v2
Date: Tue, 16 Sep 2025 21:37:05 GMT
Title: Language Models Identify Ambiguities and Exploit Loopholes
Authors: Jio Choi, Mohit Bansal, Elias Stengel-Eskin,
Abstract summary: We study the responses of large language models (LLMs) to loopholes.<n>We find that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
Score: 67.74087963315213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models' abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.

Related papers

Discovering Implicit Large Language Model Alignment Objectives [28.70744709029665]
Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized.<n>We introduce -Disco, a framework that decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.<n>Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.
arXiv Detail & Related papers (2026-02-17T03:58:55Z)
Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs [8.879888552904598]
Large Language Models (LLMs) are intended to reflect human linguistic competencies.<n>We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference.<n>We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
arXiv Detail & Related papers (2025-09-17T22:12:30Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models [9.05950721565821]
We study strategic deception in large language models (LLMs)<n>We induce, detect, and control such deception in CoT-enabled LLMs.<n>We achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts.
arXiv Detail & Related papers (2025-06-05T11:44:19Z)
Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey [54.90240495777929]
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP)<n>With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications.<n>This paper explores the definition, forms, and implications of ambiguity for language driven systems.
arXiv Detail & Related papers (2025-05-18T20:53:41Z)
Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries. Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries. Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z)
Does Faithfulness Conflict with Plausibility? An Empirical Study in Explainable AI across NLP Tasks [9.979726030996051]
We show that Shapley value and LIME could attain greater faithfulness and plausibility. Our findings suggest that rather than optimizing for one dimension at the expense of the other, we could seek to optimize explainability algorithms with dual objectives.
arXiv Detail & Related papers (2024-03-29T20:28:42Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models [64.58271886337826]
We study ambiguities that arise in text-to-image generative models. We propose a framework to mitigate ambiguities in the prompts given to the systems by soliciting clarifications from the user.
arXiv Detail & Related papers (2022-11-17T17:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.