Related papers: Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context

Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context

URL: http://arxiv.org/abs/2410.16069v1
Date: Mon, 21 Oct 2024 14:47:37 GMT
Title: Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context
Authors: Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi,
Abstract summary: We construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning. Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context. We make our code and dataset publicly available.
Score: 12.781022584125925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human processing of idioms relies on understanding the contextual sentences in which idioms occur, as well as language-intrinsic features such as frequency and speaker-intrinsic factors like familiarity. While LLMs have shown high performance on idiomaticity detection tasks, this success may be attributed to reasoning shortcuts in existing datasets. To this end, we construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning. Additionally, we explore how collocational frequency and sentence probability influence model performance. Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context, and that models perform better on sentences that have higher likelihood. The collocational frequency of expressions also impacts performance. We make our code and dataset publicly available.

Related papers

END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. They are often distracted by irrelevant or noisy context in input sequences that degrades output quality. We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning. We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics. Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system. It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing [39.375342978538654]
We focus on utilizing Large Language Models (LLMs) to perform machine translation. We observe that two patterns of errors frequently occur and drastically affect the translation quality: language mismatch and repetition. We explore the potential for mitigating these two issues by leveraging model editing methods.
arXiv Detail & Related papers (2024-10-09T16:51:21Z)
Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models [41.233879429714925]
This study critically examines the capacity of API-based large language models to comprehend phrase semantics. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics.
arXiv Detail & Related papers (2024-10-03T08:44:17Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Robustness of LLMs to Perturbations in Text [2.0670689746336]
Large language models (LLMs) have shown impressive performance, but can they handle the inevitable noise in real-world data? This work tackles this critical question by investigating LLMs' resilience against morphological variations in text. Our findings show that contrary to popular beliefs, generative LLMs are quiet robust to noisy perturbations in text.
arXiv Detail & Related papers (2024-07-12T04:50:17Z)
Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL [78.80673954827773]
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. We propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. We are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
arXiv Detail & Related papers (2024-05-10T11:44:05Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
See the Unseen: Better Context-Consistent Knowledge-Editing by Noises [73.54237379082795]
Knowledge-editing updates knowledge of large language models (LLMs) Existing works ignore this property and the editing lacks generalization. We empirically find that the effects of different contexts upon LLMs in recalling the same knowledge follow a Gaussian-like distribution.
arXiv Detail & Related papers (2024-01-15T09:09:14Z)
Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives. We probe them with respect to their common-sense knowledge of the structure and duration of events. We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality [29.433401785920065]
Large language models (LLMs) can generate causal arguments with high probability. LLMs may be used by human domain experts to save effort in setting up a causal analysis.
arXiv Detail & Related papers (2023-04-28T19:00:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.