Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
- URL: http://arxiv.org/abs/2509.22338v1
- Date: Fri, 26 Sep 2025 13:30:50 GMT
- Title: Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
- Authors: Felix Vossel, Till Mossakowski, Björn Gehrke,
- Abstract summary: predicate availability boosts performance by 15-20%.<n>Models generalize to unseen logical arguments without specific training.<n>While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
Related papers
- Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z) - BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z) - Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning [4.071220436730322]
Large Language Models (LLMs) are primarily trained on high-resource natural languages.<n>This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities.
arXiv Detail & Related papers (2025-03-25T21:12:29Z) - DeepRTL: Bridging Verilog Understanding and Generation with a Unified Representation Model [13.532046953850902]
We present DeepRTL, a unified representation model that excels in both Verilog understanding and generation.<n>Based on CodeT5+, DeepRTL is fine-tuned on a comprehensive dataset that aligns Verilog code with rich, multi-level natural language descriptions.<n>We introduce the first benchmark for Verilog understanding and take the initiative to apply embedding similarity and GPT Score to evaluate the models' understanding capabilities.
arXiv Detail & Related papers (2025-02-20T11:07:55Z) - Transformer-based Language Models for Reasoning in the Description Logic ALCQ [2.8210912543324658]
We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $mathcalALCQ$.
We investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models.
We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task.
arXiv Detail & Related papers (2024-10-12T18:25:34Z) - Learning Semantic Structure through First-Order-Logic Translation [4.005483185111992]
We show that language models sometimes confuse which predicates apply to which objects.
We finetune several large language models on synthetic datasets designed to gauge their generalization abilities.
The results show that FOL translation for LLMs is better suited to learn predicate argument structure.
arXiv Detail & Related papers (2024-10-04T07:39:34Z) - Reliable Reasoning Beyond Natural Language [0.047888359248129786]
Large Language models (LLMs) often exhibit limitations in their ability to reason reliably and flexibly.
We propose a neurosymbolic approach that prompts LLMs to extract and encode all relevant information from a problem statement as logical code statements.
We then use a logic programming language (Prolog) to conduct the iterative computations of explicit deductive reasoning.
arXiv Detail & Related papers (2024-07-16T04:34:18Z) - Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning [89.89857766491475]
We propose a curriculum-based logical-aware instruction tuning framework, named LACT.<n>Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition.<n> Experiments across widely used datasets demonstrate that LACT has substantial improvements(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art.
arXiv Detail & Related papers (2024-05-02T18:12:08Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - APOLLO: A Simple Approach for Adaptive Pretraining of Language Models
for Logical Reasoning [73.3035118224719]
We propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities.
APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.
arXiv Detail & Related papers (2022-12-19T07:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.