Related papers: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

URL: http://arxiv.org/abs/2505.21354v2
Date: Wed, 30 Jul 2025 03:20:16 GMT
Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning
Authors: Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah,
Abstract summary: Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP)<n>No human-annotated Bengali dataset has previously addressed this task.<n>We created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language's low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO [0.0]
We present a Bengali mathematical reasoning model called GanitLLM.<n>We also present a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline.
arXiv Detail & Related papers (2026-01-11T03:49:18Z)
Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems [0.0]
Chain-of-Thought (CoT) prompting has shown promise, but its linear structure often propagates errors.<n>We present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset.
arXiv Detail & Related papers (2025-12-05T10:07:08Z)
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali [0.0]
We present BengaliFig, a compact yet richly annotated challenge set.<n>The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions.<n>Each item is annotated along five dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty.
arXiv Detail & Related papers (2025-11-25T15:26:47Z)
HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis [0.0]
Bengali is an underrepresented language in NLP research.<n>We systematically investigate the challenges that hinder Bengali NLP performance.<n>Our findings reveal consistent performance gaps for Bengali compared to English.
arXiv Detail & Related papers (2025-07-31T05:16:43Z)
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali [0.0]
We introduce BnMMLU, a benchmark to evaluate the language understanding capabilities of Bengali in language models.<n>The dataset spans 23 domains, including science, humanities, mathematics and general knowledge.<n>We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set.
arXiv Detail & Related papers (2025-05-25T02:54:31Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
Empowering Bengali Education with AI: Solving Bengali Math Word Problems through Transformer Models [0.0]
This paper develops an innovative approach to solving Bengali MWPs using transformer-based models.<n>To support this effort, the "PatiGonit" dataset was introduced, containing 10,000 Bengali math problems.<n>The evaluation revealed that the mT5 model achieved the highest accuracy of 97.30%.
arXiv Detail & Related papers (2025-01-05T16:50:55Z)
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs [2.309018557701645]
We aim to explore the question of whether there is a need for English-oriented Large Language Models dedicated to a low-resource language.<n>We compare the performance of open-weight and closed-source LLMs against fine-tuned encoder-decoder models.<n>Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent.
arXiv Detail & Related papers (2024-06-29T11:50:16Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
Generate & Rank: A Multi-task Framework for Math Word Problems [48.99880318686938]
Math word problem (MWP) is a challenging and critical task in natural language processing. We propose Generate & Rank, a framework based on a generative pre-trained language model. By joint training with generation and ranking, the model learns from its own mistakes and is able to distinguish between correct and incorrect expressions.
arXiv Detail & Related papers (2021-09-07T12:21:49Z)
A Continuous Space Neural Language Model for Bengali Language [0.4799822253865053]
This paper proposes a continuous-space neural language model, or more specifically an ASGD weight dropped LSTM language model, along with techniques to efficiently train it for Bengali Language. The proposed architecture outperforms its counterparts by achieving an inference perplexity as low as 51.2 on the held out data set for Bengali.
arXiv Detail & Related papers (2020-01-11T14:50:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.