Related papers: Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

URL: http://arxiv.org/abs/2510.14030v1
Date: Wed, 15 Oct 2025 19:12:43 GMT
Title: Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games
Authors: César Guerra-Solano, Zhuochun Li, Xiang Lorraine Li,
Abstract summary: We propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages.<n>We constructed a game benchmark with five linguistic backgrounds in both the native language and an English translation for comparison.<n>We find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
Score: 4.924013532447991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

Related papers

Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil [1.0499611180329804]
Large language models (LLMs) demonstrate strong mathematical reasoning in English.<n>But whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear.<n>We evaluate four prominent large language models using a taxonomy of six math problem types.
arXiv Detail & Related papers (2026-02-16T07:08:37Z)
Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [39.03934159726098]
M2A is a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions.<n>We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages.<n>Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks.
arXiv Detail & Related papers (2025-07-07T19:04:36Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z)
Inductive Linguistic Reasoning with Large Language Models [0.0]
We investigate the abilities of large language models to perform abstract multilingual reasoning through the lens of linguistic puzzles.<n>We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context.<n>Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities.
arXiv Detail & Related papers (2024-12-09T03:37:11Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Improving Factuality and Reasoning in Language Models through Multiagent Debate [95.10641301155232]
We present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate.
arXiv Detail & Related papers (2023-05-23T17:55:11Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.