Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
- URL: http://arxiv.org/abs/2509.14456v2
- Date: Tue, 21 Oct 2025 17:46:51 GMT
- Title: Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
- Authors: Amber Shore, Russell Scheinberg, Ameeta Agrawal, So Young Lee,
- Abstract summary: Large Language Models (LLMs) are intended to reflect human linguistic competencies.<n>We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference.<n>We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
- Score: 8.879888552904598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
Related papers
- It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge [3.340255811686752]
We investigate whether Large Language Models can leverage commonsense to resolve referential ambiguity in multi-turn conversations.<n>We test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations.
arXiv Detail & Related papers (2025-09-19T15:49:26Z) - Do Large Language Models Understand Word Senses? [30.57967629882536]
We evaluate the Word Sense Disambiguation (WSD) capabilities of instruction-tuned Large Language Models (LLMs)<n>We find that leading models such as GPT-4o and DeepSeek-V3 achieve performance on par with specialized WSD systems.<n>Results reveal that LLMs can explain the meaning of words in context up to 98% accuracy.
arXiv Detail & Related papers (2025-09-17T11:11:27Z) - Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity [16.065963688326242]
We study the trustworthiness of large language models (LLMs) when encountering ambiguous narrative text in Chinese.<n>We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs.<n>We discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans.
arXiv Detail & Related papers (2025-07-30T21:50:19Z) - Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey [54.90240495777929]
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP)<n>With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications.<n>This paper explores the definition, forms, and implications of ambiguity for language driven systems.
arXiv Detail & Related papers (2025-05-18T20:53:41Z) - Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries.
Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries.
Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding.
We characterize ambiguity in a sentence by its effect on entailment relations with another sentence.
We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z) - Exploring Multi-Modal Representations for Ambiguity Detection &
Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI.
Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments.
Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.