RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
- URL: http://arxiv.org/abs/2504.20426v1
- Date: Tue, 29 Apr 2025 04:42:02 GMT
- Title: RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
- Authors: Jiapeng Wang, Jinhao Jiang, Zhiqiang Zhang, Jun Zhou, Wayne Xin Zhao,
- Abstract summary: RV-Syn is a novel mathematical Synthesis approach.<n>It generates graphs as solutions by combining Python-formatted functions from this library.<n>Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
- Score: 58.404895570822184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the inner logic of the problem during generation and ensuring the verifiability of the solutions. To address these issues, we propose RV-Syn, a novel Rational and Verifiable mathematical Synthesis approach. RV-Syn constructs a structured mathematical operation function library based on initial seed problems and generates computational graphs as solutions by combining Python-formatted functions from this library. These graphs are then back-translated into complex problems. Based on the constructed computation graph, we achieve solution-guided logic-aware problem generation. Furthermore, the executability of the computational graph ensures the verifiability of the solving process. Experimental results show that RV-Syn surpasses existing synthesis methods, including those involving human-generated problems, achieving greater efficient data scaling. This approach provides a scalable framework for generating high-quality reasoning datasets.
Related papers
- Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages [13.377908992869814]
Problem-solving data significantly enhances the model's mathematical capabilities compared to general mathematical corpora.<n>We identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance.
arXiv Detail & Related papers (2025-01-23T12:14:57Z) - Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization [56.17811386955609]
Graph-structured challenges are inherently difficult due to their nonlinear and intricate nature.<n>In this study, we propose transforming graphs into images to preserve their higher-order structural features accurately.<n>By combining the innovative paradigm powered by multimodal large language models with simple search techniques, we aim to develop a novel and effective framework.
arXiv Detail & Related papers (2025-01-21T08:28:10Z) - Data Augmentation with Variational Autoencoder for Imbalanced Dataset [1.2289361708127877]
Learning from an imbalanced distribution presents a major challenge in predictive modeling.
We develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically designed to address the challenges of IR.
arXiv Detail & Related papers (2024-12-09T22:59:03Z) - Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.<n>Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.<n>We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications.<n>We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - A Differentiable Approach to Combinatorial Optimization using Dataless
Neural Networks [20.170140039052455]
We propose a radically different approach in that no data is required for training the neural networks that produce the solution.
In particular, we reduce the optimization problem to a neural network and employ a dataless training scheme to refine the parameters of the network such that those parameters yield the structure of interest.
arXiv Detail & Related papers (2022-03-15T19:21:31Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Learning Time-Varying Graphs from Online Data [39.21234914444073]
This work proposes an algorithmic framework to learn time-varying graphs from online data.
It renders it model-independent, i.e., it can be theoretically analyzed in its abstract formulation.
We specialize the framework to three well-known graph learning models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM), and the smoothness-based model (SBM)
arXiv Detail & Related papers (2021-10-21T09:46:44Z) - A Reinforcement Learning Environment for Mathematical Reasoning via
Program Synthesis [1.1602089225841632]
We convert the DeepMind Mathematics dataset into a reinforcement learning environment.
Each action taken in the environment adds an operator or an input into a discrete compute graph.
Graphs which compute correct answers yield positive reward, enabling the optimization of a policy to construct compute graphs conditioned on problem statements.
arXiv Detail & Related papers (2021-07-15T14:55:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.