Related papers: Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

URL: http://arxiv.org/abs/2501.08613v3
Date: Fri, 05 Sep 2025 06:35:41 GMT
Title: Assessing the Sensitivity and Alignment of FOL Closeness Metrics
Authors: Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi,
Abstract summary: We study the sensitivity of existing NL-, FOL-, and graph-based metrics to capture differences between a sampled FOL and its corresponding ground-truth.<n>We show that combining metrics enhances both robustness and sensitivity compared to using individual metrics.
Score: 10.795521518273214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing NL-, FOL-, and graph-based metrics to capture differences between a sampled FOL and its corresponding ground-truth. We then measure the alignment between a metric-based ranking of FOL outputs and a strong LLM as-a-judge. To do this, we first apply operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing the metrics against LLMs judgment. Our empirical findings highlight a clear oversensitivity in the n-gram metric BLEU for text perturbations. The operator perturbation affects the semantic graph metric Smatch++ for structural changes, and the FOL metric for specific operator changes. We observe a closer alignment between BertScore and LLM judgement, proving the importance of semantic evaluation. Additionally, we show that combining metrics enhances both robustness and sensitivity compared to using individual metrics.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
A Critical Study of Automatic Evaluation in Sign Language Translation [17.083206782232185]
It remains unclear to what extent text-based metrics can reliably capture the quality of sign language translation (SLT) outputs.<n>We analyze six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other.
arXiv Detail & Related papers (2025-10-29T11:57:03Z)
All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations [57.8036236269546]
Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important.<n>This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details.<n>We introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query.
arXiv Detail & Related papers (2025-10-08T14:40:33Z)
Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models [2.3204178451683264]
Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards.<n>This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner.
arXiv Detail & Related papers (2025-07-05T11:08:03Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [12.950770409452035]
We introduce two metrics for classification tasks, namely sensitivity and consistency.<n> sensitivity measures changes of predictions across rephrasings of the prompt.<n>Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples. We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z)
BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.