Related papers: BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

URL: http://arxiv.org/abs/2511.13095v1
Date: Mon, 17 Nov 2025 07:50:12 GMT
Title: BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
Authors: Chuyuan Li, Giuseppe Carenini,
Abstract summary: We introduce BeDiscovER, an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs.<n>BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets.
Score: 13.300475053766862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

Related papers

DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking [6.8009771183515575]
We present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding.<n>Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
arXiv Detail & Related papers (2025-10-19T21:26:27Z)
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering [51.7493726399073]
We present a discourse-aware hierarchical framework to enhance long document question answering.<n>The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval.
arXiv Detail & Related papers (2025-05-26T14:45:12Z)
Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning [5.4141465747474475]
Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving problems of moderate complexity.<n>We systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph.
arXiv Detail & Related papers (2025-02-19T20:20:24Z)
DEPTH: Discourse Education through Pre-Training Hierarchically [33.89893399779713]
DEPTH is an encoder-decoder model that learns latent representations for sentences using a discourse-oriented pre-training objective.<n>Our approach extends the discourse capabilities of T5, while minimally impacting other natural language understanding (NLU) capabilities.
arXiv Detail & Related papers (2024-05-13T14:35:30Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training [56.74440457571821]
We analyze tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.
arXiv Detail & Related papers (2023-10-25T09:09:55Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response. We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English. Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z)
ChatGPT Evaluation on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations [52.26802326949116]
We quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations. ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations. It is capable of identifying the majority of discourse relations with existing explicit discourse connectives, but the implicit discourse relation remains a formidable challenge.
arXiv Detail & Related papers (2023-04-28T13:14:36Z)
Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes. This allows our model to detect latent topics that may include uncommon words or neologisms. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z)
Discourse Parsing of Contentious, Non-Convergent Online Discussions [0.16311150636417257]
Inspired by the Bakhtinian theory of Dialogism, we propose a novel theoretical and computational framework. We develop a novel discourse annotation schema which reflects a hierarchy of discursive strategies. We share the first labeled dataset of contentious non-convergent online discussions.
arXiv Detail & Related papers (2020-12-08T17:36:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.