Related papers: It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

URL: http://arxiv.org/abs/2509.16107v1
Date: Fri, 19 Sep 2025 15:49:26 GMT
Title: It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
Authors: Lukas Ellinger, Georg Groh,
Abstract summary: We investigate whether Large Language Models can leverage commonsense to resolve referential ambiguity in multi-turn conversations.<n>We test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations.
Score: 3.340255811686752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.

Related papers

ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models [32.099137908375546]
ClarifyMT-Bench is a benchmark for multi-turn clarification in large language models (LLMs)<n>We construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns.<n>We propose textbfClarifyAgent, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning.
arXiv Detail & Related papers (2025-12-24T11:39:00Z)
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z)
Reasoning-enhanced Query Understanding through Decomposition and Interpretation [130.19204432111277]
ReDI is a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation.<n>We compiled a large-scale dataset of real-world complex queries from a major search engine.<n>Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms.
arXiv Detail & Related papers (2025-09-08T10:58:42Z)
Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity [16.065963688326242]
We study the trustworthiness of large language models (LLMs) when encountering ambiguous narrative text in Chinese.<n>We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs.<n>We discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans.
arXiv Detail & Related papers (2025-07-30T21:50:19Z)
Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions [2.6217304977339473]
We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5.<n>Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding.<n>These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.
arXiv Detail & Related papers (2025-07-16T07:25:27Z)
Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey [54.90240495777929]
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP)<n>With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications.<n>This paper explores the definition, forms, and implications of ambiguity for language driven systems.
arXiv Detail & Related papers (2025-05-18T20:53:41Z)
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs? [2.3749120526936465]
This study explores how recent large language models (LLMs) navigate relative clause attachment ambiguity in six typologically diverse languages.
arXiv Detail & Related papers (2025-03-13T19:44:15Z)
Do LLMs Understand Ambiguity in Text? A Case Study in Open-world Question Answering [15.342415325821063]
Ambiguity in natural language poses significant challenges to Large Language Models (LLMs) used for open-domain question answering. We compare off-the-shelf and few-shot LLM performance, focusing on measuring the impact of explicit disambiguation strategies. We demonstrate how simple, training-free, token-level disambiguation methods may be effectively used to improve LLM performance for ambiguous question answering tasks.
arXiv Detail & Related papers (2024-11-19T10:27:26Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Towards Effective Disambiguation for Machine Translation with Large Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences" Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding. We characterize ambiguity in a sentence by its effect on entailment relations with another sentence. We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.