Related papers: E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions

E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions

URL: http://arxiv.org/abs/2501.14951v2
Date: Sun, 09 Mar 2025 20:31:19 GMT
Title: E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions
Authors: Hongbo Zheng, Suyuan Wang, Neeraj Gangwar, Nickvash Kani,
Abstract summary: We introduce E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets.<n>We train embedding models using two strategies: generating mathematically equivalent expressions, and contrastive learning to explicitly group equivalent expressions.<n>We demonstrate that our embedding-based approach outperforms state-of-the-art large language models on several tasks.
Score: 0.33748750222488655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vector representations have been pivotal in advancing natural language processing (NLP), with prior research focusing on embedding techniques for mathematical expressions using mathematically equivalent formulations. While effective, these approaches are constrained by the size and diversity of training data. In this work, we address these limitations by introducing E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets, surpassing prior methods in size and operator variety. Leveraging this dataset, we train embedding models using two strategies: (1) generating mathematically equivalent expressions, and (2) contrastive learning to explicitly group equivalent expressions. We evaluate these embeddings on both in-distribution and out-of-distribution mathematical language processing tasks, comparing them against prior methods. Finally, we demonstrate that our embedding-based approach outperforms state-of-the-art large language models (LLMs) on several tasks, underscoring the necessity of optimizing embedding methods for the mathematical data modality. The source code and datasets are available at https://github.com/MLPgroup/E-Gen.

Related papers

Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback [37.06543502352577]
We propose a reinforcement learning-based finetuning framework to enhance the domain adaptability of foundation models for Data2Eqn tasks.<n>Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations.
arXiv Detail & Related papers (2025-05-21T14:25:41Z)
Relation-Aware Graph Foundation Model [21.86954503656643]
A graph foundation model (GFMs) has emerged as a promising direction in graph learning.<n>Unlike language models that rely on explicit token representations, graphs lack a well-defined unit for generalization.<n>We propose REEF, a novel framework that leverages relation tokens as the basic units for GFMs.
arXiv Detail & Related papers (2025-05-17T14:34:41Z)
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion. We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z)
Tensor Network Estimation of Distribution Algorithms [0.0]
Methods integrating tensor networks into evolutionary optimization algorithms have appeared in the recent literature.<n>We find that optimization performance of these methods is not related to the power of the generative model in a straightforward way.<n>In light of this we find that adding an explicit mutation operator to the output of the generative model often improves optimization performance.
arXiv Detail & Related papers (2024-12-27T18:22:47Z)
Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance. We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features. In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z)
Mathematical Language Models: A Survey [29.419915295762692]
This paper conducts a comprehensive survey of mathematical LMs, systematically categorizing pivotal research endeavors from two distinct perspectives: tasks and methodologies. The survey entails the compilation of over 60 mathematical datasets, including training datasets, benchmark datasets, and augmented datasets.
arXiv Detail & Related papers (2023-12-12T01:39:16Z)
Investigating Masking-based Data Generation in Language Models [0.0]
A feature of BERT and models with similar architecture is the objective of masked language modeling. Data augmentation is a data-driven technique widely used in machine learning. Recent studies have utilized masked language model to generate artificially augmented data for NLP downstream tasks.
arXiv Detail & Related papers (2023-06-16T16:48:27Z)
BERT is not The Count: Learning to Match Mathematical Statements with Proofs [34.61792250254876]
The task fits well within current research on Mathematical Information Retrieval and, more generally, mathematical article analysis. We present a dataset consisting of over 180k statement-proof pairs extracted from modern mathematical research articles. We propose a bilinear similarity model and two decoding methods to match statements to proofs effectively.
arXiv Detail & Related papers (2023-02-18T14:48:20Z)
ProjB: An Improved Bilinear Biased ProjE model for Knowledge Graph Completion [1.5576879053213302]
This work improves on ProjE KGE due to low computational complexity and high potential for model improvement. Experimental results on benchmark Knowledge Graphs (KGs) such as FB15K and WN18 show that the proposed approach outperforms the state-of-the-art models in entity prediction task.
arXiv Detail & Related papers (2022-08-15T18:18:05Z)
Understanding High Dimensional Spaces through Visual Means Employing Multidimensional Projections [0.0]
Two of the relevant algorithms in the data visualisation field are t-distributed neighbourhood embedding (t-SNE) and Least-Square Projection (LSP) These algorithms can be used to understand several ranges of mathematical functions including their impact on datasets. We illustrate ways of employing the visual results of multidimensional projection algorithms to understand and fine-tune the parameters of their mathematical framework.
arXiv Detail & Related papers (2022-07-12T20:30:33Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications. Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture. We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z)
Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models. We propose a simple and effective iterative training method called MIx Source and pseudo Target. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z)
Improving Compositional Generalization with Self-Training for Data-to-Text Generation [36.973617793800315]
We study the compositional generalization of current generation models in data-to-text tasks. By simulating structural shifts in the compositional Weather dataset, we show that T5 models fail to generalize to unseen structures. We propose an approach based on self-training using finetuned BLEURT for pseudo-response selection.
arXiv Detail & Related papers (2021-10-16T04:26:56Z)
Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient Descent [8.714458129632158]
Kolmogorov model (KM) is an interpretable and predictable representation approach to learning the underlying probabilistic structure of a set of random variables. We propose a computationally scalable KM learning algorithm, based on the regularized dual optimization combined with enhanced gradient descent (GD) method. It is shown that the accuracy of logical relation mining for interpretability by using the proposed KM learning algorithm exceeds $80%$.
arXiv Detail & Related papers (2021-07-11T10:33:02Z)
Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Multiple Word Embeddings for Increased Diversity of Representation [15.279850826041066]
We show a technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity is the driving force of why this technique works.
arXiv Detail & Related papers (2020-09-30T02:33:09Z)
Stochastic Flows and Geometric Optimization on the Orthogonal Group [52.50121190744979]
We present a new class of geometrically-driven optimization algorithms on the orthogonal group $O(d)$. We show that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, flows and metric learning.
arXiv Detail & Related papers (2020-03-30T15:37:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.