Related papers: Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

URL: http://arxiv.org/abs/2510.04581v1
Date: Mon, 06 Oct 2025 08:32:59 GMT
Title: Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference
Authors: Dang Anh, Rick Nouwen, Massimo Poesio,
Abstract summary: LLMs are sometimes aware of possible referents of ambiguous pronouns.<n>They do not always follow human reference when choosing between interpretations.<n>They struggle to identify ambiguity without direct instruction.
Score: 3.409902233585822
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.

Related papers

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text? [8.484462568964682]
There is no consistent or precise definition of their target, namely "LLM-generated text"<n>What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce.<n>Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications.
arXiv Detail & Related papers (2025-10-23T17:59:06Z)
Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity [16.065963688326242]
We study the trustworthiness of large language models (LLMs) when encountering ambiguous narrative text in Chinese.<n>We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs.<n>We discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans.
arXiv Detail & Related papers (2025-07-30T21:50:19Z)
Referential ambiguity and clarification requests: comparing human and LLM behaviour [11.336760165002831]
We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus -- one for reference and ambiguity in reference, and one for SDRT including clarifications.<n>We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues.<n>We question if LLMs' ability to ask clarification questions is predicated on their recent ability to simulate reasoning.
arXiv Detail & Related papers (2025-07-14T16:28:00Z)
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese [52.98034458924209]
This study investigates whether Large Language Models exhibit differential performance when prompted in two variants of written Chinese.<n>We design two benchmark tasks that reflect real-world scenarios: regional term choice and regional name choice.<n>Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language.
arXiv Detail & Related papers (2025-05-28T17:56:49Z)
Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Do LLMs write like humans? Variation in grammatical and rhetorical styles [0.6303112417588329]
Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems.<n>As they have advanced, it has become difficult to distinguish their output from human-written text.
arXiv Detail & Related papers (2024-10-21T15:35:44Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries. Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries. Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
In-Context Impersonation Reveals Large Language Models' Strengths and Biases [56.61129643802483]
We ask LLMs to assume different personas before solving vision and language tasks. We find that LLMs pretending to be children of different ages recover human-like developmental stages. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.
arXiv Detail & Related papers (2023-05-24T09:13:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.