Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems
- URL: http://arxiv.org/abs/2512.05580v1
- Date: Fri, 05 Dec 2025 10:07:08 GMT
- Title: Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems
- Authors: Aurprita Mahmood, Sabrin alam, Neloy kumer Sagor, Md. Abdul Hadi, Md. Sehab Al Islam, Minhajul Islam,
- Abstract summary: Chain-of-Thought (CoT) prompting has shown promise, but its linear structure often propagates errors.<n>We present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.
Related papers
- Once Upon an Input: Reasoning via Per-Instance Program Synthesis [19.86168542588911]
We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback.<n>To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis.
arXiv Detail & Related papers (2025-10-26T21:58:33Z) - The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation [2.497854684676663]
We present a comprehensive study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB.<n>Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers.
arXiv Detail & Related papers (2025-08-20T13:53:13Z) - Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning [0.0]
Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP)<n>No human-annotated Bengali dataset has previously addressed this task.<n>We created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions.
arXiv Detail & Related papers (2025-05-27T15:47:10Z) - Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [64.74765550805024]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 84% with minimal accuracy loss across 18 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [98.29190911211053]
Chain-of-Reasoning (CoR) is a novel unified framework integrating multiple reasoning paradigms.<n>CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution.<n> Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.<n>CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.<n>We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models [15.65204261844768]
We propose a new approach named Problem Elaboration Prompting (PEP) to enhance the mathematical capacities of large language models (LLMs)
PEP decomposes and elucidates the problem context before reasoning, therefore enhancing the context modeling and parsing efficiency.
arXiv Detail & Related papers (2024-02-24T08:40:30Z) - How Do Humans Write Code? Large Models Do It the Same Way Too [14.954886191356342]
Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in Large Language Models.
Using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT.
We propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT.
arXiv Detail & Related papers (2024-02-24T05:40:01Z) - Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models [73.4425450752596]
Chain-of-thought (CoT) prompting has impressively unlocked the reasoning potential of large language models (LLMs)
Yet, the standard CoT is less effective in problems demanding multiple reasoning steps.
We propose RESPROMPT, a new prompting strategy that advances multi-step reasoning in LLMs.
arXiv Detail & Related papers (2023-10-07T08:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.