Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization
- URL: http://arxiv.org/abs/2502.15795v1
- Date: Tue, 18 Feb 2025 19:16:54 GMT
- Title: Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization
- Authors: Willy Chan, Michael Souliman, Jakob Nordhagen, Brando Miranda, Elyas Obbad, Kai Fronsdal Sanmi Koyejo,
- Abstract summary: We introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models.<n>We show that our approaches surpass the performance of fine-tuning with extensive multilingual datasets.<n> Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize, thereby accelerating AI for math.
- Score: 1.204553980682492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.
Related papers
- Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking [21.23826888841565]
We present a novel approach for training small language models for reasoning-intensive document ranking.
We use web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations.
Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches.
arXiv Detail & Related papers (2025-04-04T21:27:48Z) - Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques [0.0]
This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs.<n>We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation.
arXiv Detail & Related papers (2025-02-05T16:21:10Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Improving Attributed Text Generation of Large Language Models via Preference Learning [28.09715554543885]
We model the attribution task as preference learning and introduce an Automatic Preference Optimization framework.
APO achieves state-of-the-art citation F1 with higher answer quality.
arXiv Detail & Related papers (2024-03-27T09:19:13Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking.
It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases.
This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.