Related papers: NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

URL: http://arxiv.org/abs/2511.09971v1
Date: Fri, 14 Nov 2025 01:22:57 GMT
Title: NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction
Authors: Peter Røysland Aarnes, Vinay Setty,
Abstract summary: We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs.<n>Results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations.<n>These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
Score: 7.856998585396422
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

Related papers

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks [0.9338697277815541]
We investigate the performance of large language models on repetitive deterministic prediction tasks.<n>Our experiments reveal a sharp double exponential drop beyond a characteristic length scale.<n>This indicates that the models fail to execute each operation independently.
arXiv Detail & Related papers (2025-11-02T01:42:08Z)
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models [49.92148175114169]
We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions.<n>Models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states.<n>Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely.
arXiv Detail & Related papers (2025-10-15T14:51:36Z)
Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
Are vision language models robust to uncertain inputs? [5.249651874118556]
We show that newer and larger vision language models exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions.<n>For natural images such as ImageNet, this limitation can be overcome without pipeline modifications.<n>We propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty.
arXiv Detail & Related papers (2025-05-17T03:16:49Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
LoGU: Long-form Generation with Uncertainty Expressions [49.76417603761989]
We introduce the task of Long-form Generation with Uncertainty(LoGU)<n>We identify two key challenges: Uncertainty Suppression and Uncertainty Misalignment.<n>Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims.<n>Experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
arXiv Detail & Related papers (2024-10-18T09:15:35Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
NLI Data Sanity Check: Assessing the Effect of Data Corruption on Model Performance [3.7024660695776066]
We propose a new diagnostics test suite which allows to assess whether a dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities. We specifically apply controlled corruption transformations to widely used benchmarks (MNLI and ANLI) A large decrease in model accuracy indicates that the original dataset provides a proper challenge to the models' reasoning capabilities.
arXiv Detail & Related papers (2021-04-10T12:28:07Z)
Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.