Improving Large Language Model Fine-tuning for Solving Math Problems
- URL: http://arxiv.org/abs/2310.10047v1
- Date: Mon, 16 Oct 2023 04:11:19 GMT
- Title: Improving Large Language Model Fine-tuning for Solving Math Problems
- Authors: Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-Reyes, Peter J.
Liu
- Abstract summary: A large gap exists between large language models' pass-at-one and pass-at-N performance in solving math problems.
Using the challenging MATH dataset, we investigate three fine-tuning strategies.
We design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models.
- Score: 20.417053742869403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their success in many natural language tasks, solving math problems
remains a significant challenge for large language models (LLMs). A large gap
exists between LLMs' pass-at-one and pass-at-N performance in solving math
problems, suggesting LLMs might be close to finding correct solutions,
motivating our exploration of fine-tuning methods to unlock LLMs' performance.
Using the challenging MATH dataset, we investigate three fine-tuning
strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed
solution for a given math problem; (2) solution-cluster re-ranking, where the
LLM is fine-tuned as a solution verifier/evaluator to choose among generated
candidate solution clusters; (3) multi-task sequential fine-tuning, which
integrates both solution generation and evaluation tasks together efficiently
to enhance the LLM performance. With these methods, we present a thorough
empirical study on a series of PaLM 2 models and find: (1) The quality and
style of the step-by-step solutions used for fine-tuning can make a significant
impact on the model performance; (2) While solution re-ranking and majority
voting are both effective for improving the model performance when used
separately, they can also be used together for an even greater performance
boost; (3) Multi-task fine-tuning that sequentially separates the solution
generation and evaluation tasks can offer improved performance compared with
the solution fine-tuning baseline. Guided by these insights, we design a
fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset
with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the
few-shot performance of pre-trained PaLM 2-L model with majority voting.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
We develop an automated approach to creating optimization models from natural language descriptions for commercial solvers.
We identify the three core challenges of autoformulation: (1) defining the vast, problem-dependent hypothesis space, (2) efficiently searching this space under uncertainty, and (3) evaluating formulation correctness.
arXiv Detail & Related papers (2024-11-03T20:41:38Z) - Solving General Natural-Language-Description Optimization Problems with Large Language Models [34.50671063271608]
We propose a novel framework called OptLLM that augments LLMs with external solvers.
OptLLM accepts user queries in natural language, convert them into mathematical formulations and programming codes, and calls the solvers to calculate the results.
Some features of OptLLM framework have been available for trial since June 2023.
arXiv Detail & Related papers (2024-07-09T07:11:10Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Thought of Search: Planning with Language Models Through The Lens of Efficiency [22.47015814897628]
We argue that recent trends abandon both soundness and completeness for the sake of inefficiency.
We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy.
arXiv Detail & Related papers (2024-04-18T01:27:29Z) - V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions.
Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z) - Adaptive-Solver Framework for Dynamic Strategy Selection in Large
Language Model Reasoning [34.568072559937455]
Large Language Models (LLMs) are showcasing impressive ability in handling complex reasoning tasks.
Most methodologies that leverage LLMs tend to adopt a uniform approach.
Inflexibility of them can bring unnecessary computational overhead or sub-optimal performance.
We introduce an Adaptive-r framework that strategically modulates solving strategies based on the difficulties of the problems.
arXiv Detail & Related papers (2023-10-01T12:28:36Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.