Related papers: From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene

From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene

URL: http://arxiv.org/abs/2510.21575v1
Date: Fri, 24 Oct 2025 15:43:42 GMT
Title: From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene
Authors: Mojca Brglez, Špela Vintar,
Abstract summary: We introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene.<n>We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs.<n>Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances.
Score: 0.12277343096128711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.

Related papers

When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms [1.5840067220859924]
Figurative language understanding remains a significant challenge for Large Language Models (LLMs)<n>We introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms.<n>Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process.<n>We evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning.
arXiv Detail & Related papers (2026-02-13T13:26:11Z)
On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts [69.69818198773244]
We study a range of LMs on both language comprehension and language production.<n>We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension.
arXiv Detail & Related papers (2025-09-08T17:59:32Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
This study introduces MultiPragEval, the first pragmatic evaluation of Large Language Models (LLMs) Comprising 1200 question units categorized according to Grice's Cooperative Principle, MultiPragEval enables an in-depth assessment of LLMs' contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field.
arXiv Detail & Related papers (2024-06-11T21:46:03Z)
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency [13.154753046052527]
We focus on consistency across languages as well as paraphrases. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like.
arXiv Detail & Related papers (2024-04-18T12:48:17Z)
Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean. We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z)
SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL) SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z)
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels. We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.