Related papers: Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

URL: http://arxiv.org/abs/2510.09536v1
Date: Fri, 10 Oct 2025 16:49:12 GMT
Title: Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
Authors: Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, Michael A. Hedderich,
Abstract summary: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs.<n>Most benchmarks assume clean input, leaving the robustness of LLMs to typos largely underexplored.<n>We introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior.
Score: 45.37878669586302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.

Related papers

Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation [50.93756215410832]
This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding.<n>The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed.
arXiv Detail & Related papers (2025-10-20T14:02:37Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [25.402797722575805]
Indic QA Benchmark is a dataset for context grounded question answering in 11 major Indian languages.<n> Evaluations revealed weak performance in low resource languages due to a strong English language bias in their training data.<n>We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
LMentry: A Language Model Benchmark of Elementary Language Tasks [39.71352171304755]
LMentry is a benchmark that focuses on a compact set of tasks that are trivial to humans. It provides insights into the capabilities and robustness of large language models.
arXiv Detail & Related papers (2022-11-03T18:01:12Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge. However, studies on LMs' factual representation ability have almost invariably been performed on English. We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.