Related papers: Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

URL: http://arxiv.org/abs/2511.11816v1
Date: Fri, 14 Nov 2025 19:11:41 GMT
Title: Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy
Authors: Andrea Brunello, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno,
Abstract summary: First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL)<n> converting NL to FOL (NL-FOL translation) has remained a longstanding challenge, for both humans and machines.
Score: 8.915674937865676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

Related papers

Towards Autoformalization of LLM-generated Outputs for Requirement Verification [0.6015898117103068]
Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models.<n>This paper explores the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements.<n>Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs.
arXiv Detail & Related papers (2025-11-14T19:45:17Z)
Refining Translations with LLMs: A Constraint-Aware Iterative Prompting Approach [7.5069214839655345]
Large language models (LLMs) have demonstrated remarkable proficiency in machine translation (MT) We propose a multi-step prompt chain that enhances translation faithfulness by prioritizing key terms crucial for semantic accuracy. Experiments using Llama and Qwen as base models on the FLORES-200 and WMT datasets demonstrate significant improvements over baselines.
arXiv Detail & Related papers (2024-11-13T05:40:24Z)
LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)<n>The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.<n>The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z)
Learning-From-Mistakes Prompting for Indigenous Language Translation [3.7790255156708397]
This paper presents techniques to improve extremely low-resourced indigenous language translations. Our approaches are grounded in the use of a datastore consisting of a limited number of parallel translation examples. We harness the potential of LLMs and in-context learning techniques in such a setting for using LLMs as universal translators.
arXiv Detail & Related papers (2024-07-18T09:41:20Z)
Self-Augmented In-Context Learning for Unsupervised Word Translation [23.495503962839337]
Large language models (LLMs) demonstrate strong word translation or bilingual lexicon induction (BLI) capabilities in few-shot setups. We propose self-augmented in-context learning (SAIL) for unsupervised BLI. Our method shows substantial gains over zero-shot prompting of LLMs on two established BLI benchmarks.
arXiv Detail & Related papers (2024-02-15T15:43:05Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models [27.841725567976315]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), but their lack of interpretability has been a major concern. In this work, we introduce proto-lm, a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings. Our method's applicability and interpretability are demonstrated through experiments on a wide range of NLP tasks, and our results indicate a new possibility of creating interpretable models without sacrificing performance.
arXiv Detail & Related papers (2023-11-03T05:55:32Z)
Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
Towards Effective Disambiguation for Machine Translation with Large Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences" Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.