Multilingual Mathematical Autoformalization
- URL: http://arxiv.org/abs/2311.03755v2
- Date: Thu, 9 Nov 2023 09:58:15 GMT
- Title: Multilingual Mathematical Autoformalization
- Authors: Albert Q. Jiang, Wenda Li, Mateja Jamnik
- Abstract summary: Autoformalization is the task of translating natural language materials into machine-verifiable formalisations.
Existing methods tend to circumvent this challenge by manually curating small corpora.
In this work, we create $textttMMA$, a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs.
- Score: 14.433478397963123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoformalization is the task of translating natural language materials into
machine-verifiable formalisations. Progress in autoformalization research is
hindered by the lack of a sizeable dataset consisting of informal-formal pairs
expressing the same essence. Existing methods tend to circumvent this challenge
by manually curating small corpora or using few-shot learning with large
language models. But these methods suffer from data scarcity and formal
language acquisition difficulty. In this work, we create $\texttt{MMA}$, a
large, flexible, multilingual, and multi-domain dataset of informal-formal
pairs, by using a language model to translate in the reverse direction, that
is, from formal mathematical statements into corresponding informal ones.
Experiments show that language models fine-tuned on $\texttt{MMA}$ produce
$16-18\%$ of statements acceptable with minimal corrections on the
$\texttt{miniF2F}$ and $\texttt{ProofNet}$ benchmarks, up from $0\%$ with the
base model. We demonstrate that fine-tuning on multilingual formal data results
in more capable autoformalization models even when deployed on monolingual
tasks.
Related papers
- Large Language Bayes [22.372504018202154]
This paper takes an informal problem description as input, and combines a large language model and a probabilistic programming language.
A posterior over latent variables follows by conditioning on observed data and integrating over formal models.
We show that this produces sensible predictions without the need to specify a formal model.
arXiv Detail & Related papers (2025-04-18T18:30:29Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
Large language models (LLMs) produce uncompilable output because their next-token inference procedure does not model formal aspects of code.
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.
Our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Machine Translation to Control Formality Features in the Target Language [0.9208007322096532]
This research explores how machine learning methods are used to translate from English to languages with formality.
It was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model.
We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth.
arXiv Detail & Related papers (2023-11-22T15:42:51Z) - In What Languages are Generative Language Models the Most Formal?
Analyzing Formality Distribution across Languages [2.457872341625575]
In this work, we focus on one language property highly influenced by culture: formality.
We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages.
We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions.
arXiv Detail & Related papers (2023-02-23T19:39:52Z) - Controlling Translation Formality Using Pre-trained Multilingual
Language Models [19.465727478912072]
This paper describes the University of Maryland's submission to the Special Task on Formality Control for Spoken Language Translation at iwslt.
We investigate to what extent this problem can be addressed with a textitsingle multilingual model.
Results show that this strategy can approach the translation quality and formality control achieved by dedicated translation models.
arXiv Detail & Related papers (2022-05-13T13:47:28Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Data Annealing for Informal Language Understanding Tasks [66.2988222278475]
We propose a data annealing transfer learning procedure to bridge the performance gap on informal language tasks.
It successfully utilizes a pre-trained model such as BERT in informal language.
arXiv Detail & Related papers (2020-04-24T09:27:09Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.