An In-Context Learning Agent for Formal Theorem-Proving
- URL: http://arxiv.org/abs/2310.04353v4
- Date: Thu, 8 Feb 2024 00:32:00 GMT
- Title: An In-Context Learning Agent for Formal Theorem-Proving
- Authors: Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, Swarat
Chaudhuri
- Abstract summary: We present an in-proving learning agent for formal theorem-context in environments like Lean and Coq.
COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search.
We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
- Score: 11.433505167967983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an in-context learning agent for formal theorem-proving in
environments like Lean and Coq. Current state-of-the-art models for the problem
are finetuned on environment-specific proof data. By contrast, our approach,
called COPRA, repeatedly asks a high-capacity, general-purpose large language
model (GPT-4) to propose tactic applications from within a stateful
backtracking search. Proposed tactics are executed in the underlying proof
environment. Feedback from the execution is used to build the prompt for the
next model query, along with selected information from the search history and
lemmas retrieved from an external database. We evaluate our implementation of
COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the
CompCert project. On these benchmarks, COPRA significantly outperforms few-shot
invocations of GPT-4. It also compares favorably against finetuning-based
approaches, outperforming REPROVER, a state-of-the-art finetuned approach for
Lean, in terms of the pass@1 metric. Our code and data are available at
https://github.com/trishullab/copra.
Related papers
- BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
Many complex real-world queries require in-depth reasoning to identify relevant documents.
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
brightbenchmark is constructed from the 1,398 real-world queries collected from diverse domains.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - Reinforcement Replaces Supervision: Query focused Summarization using
Deep Reinforcement Learning [43.123290672073814]
We deal with systems that generate summaries from document(s) based on a query.
Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, we use an RL-based approach for this task.
We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity.
arXiv Detail & Related papers (2023-11-29T10:38:16Z) - Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs.
We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs.
We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z) - LlamaRec: Two-Stage Recommendation using Large Language Models for
Ranking [10.671747198171136]
We propose a two-stage framework using large language models for ranking-based recommendation (LlamaRec)
In particular, we use small-scale sequential recommenders to retrieve candidates based on the user interaction history.
LlamaRec consistently achieves datasets superior performance in both recommendation performance and efficiency.
arXiv Detail & Related papers (2023-10-25T06:23:48Z) - DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for
In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning.
DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - Improving Code Example Recommendations on Informal Documentation Using
BERT and Query-Aware LSH: A Comparative Study [0.0]
The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions.
We applied BERT, a powerful Large Language Model (LLM), to transform code examples into numerical vectors.
Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH)
Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method.
arXiv Detail & Related papers (2023-05-04T17:43:19Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.