mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning
- URL: http://arxiv.org/abs/2508.10137v1
- Date: Wed, 13 Aug 2025 18:59:02 GMT
- Title: mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning
- Authors: Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen,
- Abstract summary: We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
- Score: 74.97363626515236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.
Related papers
- FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation [13.855534865501369]
Small Language Models (SLMs) are attractive for cost-sensitive and resource-limited settings due to their efficient, low-latency inference.<n>We propose FutureMind, a modular reasoning framework that equips SLMs with strategic thinking-pattern priors.
arXiv Detail & Related papers (2026-02-01T13:26:04Z) - How Good are Foundation Models in Step-by-Step Embodied Reasoning? [79.15268080287505]
Embodied agents must make decisions that are safe, spatially coherent, and grounded in context.<n>Recent advances in large multimodal models have shown promising capabilities in visual understanding and language generation.<n>Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments.
arXiv Detail & Related papers (2025-09-18T17:56:30Z) - Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models [44.94287386776289]
We identify textbfCross-lingual Collapse, a systematic drift in which a multilingual language model reverts to its dominant pre-training language.<n>Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy.
arXiv Detail & Related papers (2025-06-06T08:08:48Z) - Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z) - Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models [45.15161506154318]
Infi-MMR is a framework to systematically unlock the reasoning potential of Multimodal Small Language Models.<n>The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities.<n>The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts.<n>The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning.
arXiv Detail & Related papers (2025-05-29T04:51:56Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Investigating Symbolic Capabilities of Large Language Models [16.88906206735967]
This study aims to bridge the gap by rigorously evaluating Large Language Models (LLMs) on a series of symbolic tasks.
Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks.
The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases.
arXiv Detail & Related papers (2024-05-21T21:24:34Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.