Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving
- URL: http://arxiv.org/abs/2501.17084v1
- Date: Tue, 28 Jan 2025 17:11:36 GMT
- Title: Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving
- Authors: Evgenii Evstafev,
- Abstract summary: This study evalu-ates 10 large language models (LLMs) with 7 to 8 billion parameters using the MATH dataset.<n>The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.
Related papers
- Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.
APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.
A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z) - Code Generation with Small Language Models: A Deep Evaluation on Codeforces [2.314213846671956]
Small Language Models offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks.
We benchmark five open SLMs across 280 Codeforces problems spanning Elo ratings from 800 to 2100.
PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%.
arXiv Detail & Related papers (2025-04-09T23:57:44Z) - Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models [8.70160958177614]
Program synthesis with Large Language Models (LLMs) suffers from a "near-miss syndrome"
We address this with a multi-agent framework called Synthesize, Execute, Instruct, Debug, and Repair (SEIDR)
We empirically explore these trade-offs by comparing replace-focused, repair-focused, and hybrid debug strategies.
arXiv Detail & Related papers (2025-03-10T16:56:51Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.
We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z) - Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We introduce Prove, a framework that leverages translated programs derived from natural language solutions as a verification mechanism.<n>Unlike vanilla majority voting, our approach filters out solutions whose corresponding program output is inconsistent with the generated solution, aggregating only those that pass verification.<n>Our results show that Prove consistently outperforms vanilla majority voting for solving mathematical reasoning tasks across all model sizes and datasets.
arXiv Detail & Related papers (2024-10-16T14:24:55Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses.
In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases.
Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks.
All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - Orca-Math: Unlocking the potential of SLMs in Grade School Math [10.206509967833664]
A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters.
To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors.
Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
arXiv Detail & Related papers (2024-02-16T23:44:38Z) - Cumulative Reasoning with Large Language Models [12.267474250936123]
Cumulative Reasoning (CR) is a novel approach that utilizes language models cumulatively and iteratively.
We demonstrate CR's superiority through several complex reasoning tasks.
CR sets new state-of-the-art on the MATH dataset.
arXiv Detail & Related papers (2023-08-08T16:18:20Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.