Neuro-Symbolic Data Generation for Math Reasoning
- URL: http://arxiv.org/abs/2412.04857v1
- Date: Fri, 06 Dec 2024 08:49:49 GMT
- Title: Neuro-Symbolic Data Generation for Math Reasoning
- Authors: Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma,
- Abstract summary: We develop an automated method for generating high-quality, supervised mathematical datasets.<n>The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems.<n> Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, surpass their state-of-the-art counterparts.
- Score: 47.00099724151703
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.
Related papers
- RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.
It generates graphs as solutions by combining Python-formatted functions from this library.
Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z) - Lemmanaid: Neuro-Symbolic Lemma Conjecturing [4.583367875081881]
We present the first steps towards a practical neuro-symbolic lemma conjecturing tool, Lemmanaid.
We train an LLM to generate lemma templates that describe the shape of a lemma, and use symbolic methods to fill in the details.
We compare Lemmanaid against an LLM trained to generate complete lemma statements as well as previous fully symbolic conjecturing methods.
arXiv Detail & Related papers (2025-04-07T11:30:36Z) - Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning [27.562284768743694]
Large language models (LLMs) can prove mathematical theorems formally by generating proof steps within a proof system.
We introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods.
We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions.
arXiv Detail & Related papers (2025-02-19T15:54:21Z) - MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion.
We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z) - Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages [13.377908992869814]
Problem-solving data significantly enhances the model's mathematical capabilities compared to general mathematical corpora.
We identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance.
arXiv Detail & Related papers (2025-01-23T12:14:57Z) - An Evolutionary Large Language Model for Hallucination Mitigation [0.0]
We propose EvoLLMs, which automates the generation of high-quality Question-answering datasets while minimizing hallucinations.<n>EvoLLMs consistently outperforms human-generated datasets in key metrics such as Depth, Relevance, and Coverage.<n>These results highlight EvoLLMs as a robust and efficient solution for QA dataset generation, significantly reducing the time and resources required for manual curation.
arXiv Detail & Related papers (2024-12-03T19:40:13Z) - Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation [15.542737858152053]
Large Language Models (LLMs) demonstrate exceptional reasoning capabilities, often achieving state-of-the-art performance in various tasks.
A promising solution is knowledge distillation, where LLMs transfer reasoning capabilities to Small Language Models (SLMs), enabling wider deployment on low-resource devices.
We propose a Feedback-Driven Distillation (FDD) framework to enhance SLMs' mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-11-22T03:12:39Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.
We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning [110.80663974060624]
Key-Point-Driven Data Synthesis (KPDDS) is a novel data synthesis framework that synthesizes question-answer pairs.
KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability.
We present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs.
arXiv Detail & Related papers (2024-03-04T18:58:30Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z) - Generating Mathematical Derivations with Large Language Models [2.363388546004777]
We leverage a symbolic engine to generate derivations of equations at scale.
We investigate the capabilities of Large Language Models when deriving goal equations from premises.
arXiv Detail & Related papers (2023-07-19T14:13:02Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Learning Mixtures of Low-Rank Models [89.39877968115833]
We study the problem of learning computational mixtures of low-rank models.
We develop an algorithm that is guaranteed to recover the unknown matrices with near-optimal sample.
In addition, the proposed algorithm is provably stable against random noise.
arXiv Detail & Related papers (2020-09-23T17:53:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.