Related papers: The Effect of Scripts and Formats on LLM Numeracy

The Effect of Scripts and Formats on LLM Numeracy

URL: http://arxiv.org/abs/2601.15251v1
Date: Wed, 21 Jan 2026 18:33:15 GMT
Title: The Effect of Scripts and Formats on LLM Numeracy
Authors: Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner,
Abstract summary: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic.<n>We investigate numerical reasoning across a wide range of numeral scripts and formats.<n>We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats.
Score: 13.419936474311159
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.

Related papers

Unravelling the Mechanisms of Manipulating Numbers in Language Models [9.583581545538479]
We explore how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms.<n>We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal.<n>Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques.
arXiv Detail & Related papers (2025-10-30T09:08:50Z)
MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning [66.35003870737727]
A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps.<n>Existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored.<n>This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.
arXiv Detail & Related papers (2025-07-24T07:03:11Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Language Models Encode Numbers Using Digit Representations in Base 10 [12.913172023910203]
We show that large language models (LLMs) make errors when handling simple numerical problems.<n>LLMs internally represent numbers with individual circular representations per-digit in base 10.<n>This digit-wise representation sheds light on the error patterns of models on tasks involving numerical reasoning.
arXiv Detail & Related papers (2024-10-15T17:00:15Z)
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Language Models Encode the Value of Numbers Linearly [28.88044346200171]
We study how language models encode the value of numbers, a basic element in math. Experimental results support the existence of encoded number values in large language models. Our research provides evidence that LLMs encode the value of numbers linearly.
arXiv Detail & Related papers (2024-01-08T08:54:22Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
Zero-Shot Question Answering over Financial Documents using Large Language Models [0.18749305679160366]
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports. We use novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language.
arXiv Detail & Related papers (2023-11-19T16:23:34Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)
FERMAT: An Alternative to Accuracy for Numerical Reasoning [11.893004722079557]
numerical reasoning is measured using a single score on existing datasets. We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT. FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
arXiv Detail & Related papers (2023-05-27T15:00:45Z)
Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems [42.782260686177395]
We propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models. We first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models. We transform and formulate the task as an analytically solvable linear system.
arXiv Detail & Related papers (2022-10-11T00:57:19Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.