Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
- URL: http://arxiv.org/abs/2508.08933v2
- Date: Mon, 13 Oct 2025 06:43:47 GMT
- Title: Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
- Authors: Khondoker Ittehadul Islam, Gabriele Sarti,
- Abstract summary: We introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset.<n>We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models.<n>Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions.
- Score: 4.452583096196014
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.
Related papers
- Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning [71.4175109189942]
We present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR)<n>This approach designates the model's primary language as the pivot language.<n>It establishes a cross-lingual self-feedback mechanism without relying on external correct answers or reward models.
arXiv Detail & Related papers (2026-01-25T03:20:00Z) - Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model [13.788758077632432]
We introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards.<n>This framework enhances multilingual reasoning by circumventing the need for human-annotated data in target languages.<n>We show that our method significantly narrows the performance gap between English and other languages.
arXiv Detail & Related papers (2025-09-29T22:03:11Z) - Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets [0.0]
This study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data.<n>This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili.<n>The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively.
arXiv Detail & Related papers (2025-09-03T03:25:11Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models [0.0]
A comparison is made among seven selected language models based on their contextual learning and question-answering abilities.
The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish.
arXiv Detail & Related papers (2024-04-25T20:10:14Z) - SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning [44.53966523376327]
SeaEval is a benchmark for multilingual foundation models.
We characterize how these models understand and reason with natural language.
We also investigate how well they comprehend cultural practices, nuances, and values.
arXiv Detail & Related papers (2023-09-09T11:42:22Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.