Related papers: WRDScore: New Metric for Evaluation of Natural Language Generation Models

WRDScore: New Metric for Evaluation of Natural Language Generation Models

URL: http://arxiv.org/abs/2405.19220v3
Date: Tue, 25 Jun 2024 10:41:43 GMT
Title: WRDScore: New Metric for Evaluation of Natural Language Generation Models
Authors: Ravil Mussabayev,
Abstract summary: We propose a new metric that measures precision and recall without resorting to any assumptions. Measuring the direct overlap between the predicted and reference sequences will not be able to capture these subtleties.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.

Related papers

AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking [25.459771464139855]
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications.<n>We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance.<n>Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines.
arXiv Detail & Related papers (2025-05-24T05:15:49Z)
On the Robust Approximation of ASR Metrics [30.524282767961463]
We propose a novel approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50%.
arXiv Detail & Related papers (2025-02-18T01:10:17Z)
Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables. We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure. We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z)
Predicting generalization performance with correctness discriminators [64.00420578048855]
We present a novel model that establishes upper and lower bounds on the accuracy, without requiring gold labels for the unseen data. We show across a variety of tagging, parsing, and semantic parsing tasks that the gold accuracy is reliably between the predicted upper and lower bounds.
arXiv Detail & Related papers (2023-11-15T22:43:42Z)
Robust Ordinal Regression for Subsets Comparisons with Interactions [2.6151761714896122]
This paper is dedicated to a robust ordinal method for learning the preferences of a decision maker between subsets. The decision model, derived from Fishburn and LaValle, is general enough to be compatible with any strict weak order on subsets. A predicted preference is considered reliable if all the simplest models (Occam's razor) explaining the preference data agree on it.
arXiv Detail & Related papers (2023-08-07T07:54:33Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
Stability-Adjusted Cross-Validation for Sparse Linear Regression [5.156484100374059]
Cross-validation techniques like k-fold cross-validation substantially increase the computational cost of sparse regression. We propose selecting hyper parameters that minimize a weighted sum of a cross-validation metric and a model's output stability. Our confidence adjustment procedure reduces test set error by 2%, on average, on 13 real-world datasets.
arXiv Detail & Related papers (2023-06-26T17:02:45Z)
Joint Metrics Matter: A Better Standard for Trajectory Forecasting [67.1375677218281]
Multi-modal trajectory forecasting methods evaluate using single-agent metrics (marginal metrics) Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group. We present the first comprehensive evaluation of state-of-the-art trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate.
arXiv Detail & Related papers (2023-05-10T16:27:55Z)
Partial sequence labeling with structured Gaussian Processes [8.239028141030621]
We propose structured Gaussian Processes for partial sequence labeling. It encodes uncertainty in the prediction and does not need extra effort for model selection and hyper parameter learning. It is evaluated on several sequence labeling tasks and the experimental results show the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-09-20T00:56:49Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
Local Evaluation of Time Series Anomaly Detection Algorithms [9.717823994163277]
We show that an adversary algorithm can reach high precision and recall on almost any dataset under weak assumption. We propose a theoretically grounded, robust, parameter-free and interpretable extension to precision/recall metrics.
arXiv Detail & Related papers (2022-06-27T10:18:41Z)
TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z)
Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.