Related papers: Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection

Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection

URL: http://arxiv.org/abs/2508.13365v1
Date: Mon, 18 Aug 2025 21:17:09 GMT
Title: Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection
Authors: Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Thomas Pickard, Maggie Mi, Aline Villavicencio,
Abstract summary: We examine how reasoning capabilities in Large Language Models affect idiomaticity detection performance.<n>We find the effect of reasoning to be smaller and more varied than expected.<n>For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models.
Score: 2.8330244018167945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.

Related papers

Fluid Representations in Reasoning Models [91.77876704697779]
We present a mechanistic analysis of how QwQ-32B processes abstract structural information.<n>We find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning.
arXiv Detail & Related papers (2026-02-04T18:34:50Z)
Rank-1 LoRAs Encode Interpretable Reasoning Signals [0.764671395172401]
Reasoning models leverage inference-time compute to significantly enhance the performance of language models on logical tasks.<n>Despite their wide adoption, the mechanisms underpinning the enhanced performance of these reasoning models are not well understood.<n>We show that the majority of new capabilities in reasoning models can be elicited by small, single-rank changes to base model parameters.
arXiv Detail & Related papers (2025-11-10T06:00:25Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Reasoning Capabilities of Large Language Models on Dynamic Tasks [0.017476232824732776]
Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear.<n>We evaluate three prompting strategies: self-reflection, mutation, and planning across dynamic tasks with open-source models.<n>We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap.
arXiv Detail & Related papers (2025-05-15T17:53:47Z)
Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [12.559028963968247]
We investigate the crucial relationship between a model's reasoning ability and fairness.<n>We find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias.<n>We introduce ReGiFT, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities.
arXiv Detail & Related papers (2025-04-08T03:21:51Z)
DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models [50.54264918467997]
Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks.<n>Recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language.<n>We propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior.
arXiv Detail & Related papers (2025-02-25T16:44:10Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Efficient argument classification with compact language models and ChatGPT-4 refinements [0.0]
This paper presents comparative studies between a few deep learning-based models in argument mining. The main novelty of this paper is the ensemble model which is based on BERT architecture and ChatGPT-4 as fine tuning model. The presented results show that BERT+ChatGPT-4 outperforms the rest of the models including other Transformer-based and LSTM-based models.
arXiv Detail & Related papers (2024-03-20T16:24:10Z)
GLoRE: Evaluating Logical Reasoning of Large Language Models [20.77694584450457]
We introduce GLoRE, a platform that consolidates diverse datasets and standardizes them into a unified format for evaluating large language models.<n>Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements.
arXiv Detail & Related papers (2023-10-13T13:52:15Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
To what extent do human explanations of model behavior align with actual model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words. We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z)
CausaLM: Causal Model Explanation Through Counterfactual Language Models [33.29636213961804]
CausaLM is a framework for producing causal model explanations using counterfactual language representation models. We show that language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest. A byproduct of our method is a language representation model that is unaffected by the tested concept.
arXiv Detail & Related papers (2020-05-27T15:06:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.