Early-Exit and Instant Confidence Translation Quality Estimation
- URL: http://arxiv.org/abs/2502.14429v2
- Date: Mon, 07 Jul 2025 22:42:38 GMT
- Title: Early-Exit and Instant Confidence Translation Quality Estimation
- Authors: Vilém Zouhar, Maike Züfle, Beni Egressy, Julius Cheng, Mrinmaya Sachan, Jan Niehues,
- Abstract summary: We tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation.<n>To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs.<n>We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs.
- Score: 46.13074343863971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance. Finally, we show how Instant Confidence COMET can be used to decide which translations a human evaluator should score rather than relying on the COMET score.
Related papers
- K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z) - Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators [13.227055178509524]
We propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level.<n>We show that proper calibration of $varepsilon$ ensures reliable evaluation across different variance regimes.<n> Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.
arXiv Detail & Related papers (2026-02-06T22:14:46Z) - Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference [4.254924788681319]
This paper introduces a cost-aware Proof of Quality (PoQ) framework for decentralized large language model (LLM) inference.<n>The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline.<n> Monte Carlo simulations over 5,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models.
arXiv Detail & Related papers (2025-12-18T08:57:17Z) - Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas [31.16720541398267]
We propose a doubly-robust estimation framework designed to address evaluation sampling bias.<n>Key to our approach is the use of "persona" ratings produced by prompting an evaluator to behave as a human rater.<n>We show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality.
arXiv Detail & Related papers (2025-09-26T21:42:51Z) - Cost-Optimal Active AI Model Evaluation [71.2069549142394]
Development of generative AI systems requires continual evaluation, data acquisition, and annotation.<n>We develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater.<n>We derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters.
arXiv Detail & Related papers (2025-06-09T17:14:41Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - Predicting Bad Goods Risk Scores with ARIMA Time Series: A Novel Risk Assessment Approach [0.0]
This research presents a novel framework that integrates Time Series ARIMA models with a proprietary formula designed to calculate bad goods after time series forecasting.
Experimental results, validated on a dataset spanning 2022-2024 for Organic Beer-G 1 Liter, demonstrate that the proposed method outperforms traditional statistical models.
arXiv Detail & Related papers (2025-02-23T09:52:11Z) - A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation [5.653106385738822]
Polyrating is an expressive and flexible rating system based on a maximum posteriori estimation.<n>It can detect and quantify biases affecting human preferences, ensuring fairer model comparisons.<n>It can reduce the cost of human evaluations by up to $41%$ for new models and up to $77%$ for new tasks.
arXiv Detail & Related papers (2024-09-01T11:24:54Z) - Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms.
Nearly 1,200 teams from across the world participated.
We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z) - Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation [14.405862891194344]
We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors.
Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output.
We propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones.
arXiv Detail & Related papers (2024-04-27T23:52:51Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Uncertainty-aware No-Reference Point Cloud Quality Assessment [25.543217625958462]
This work presents the first probabilistic architecture for no-reference point cloud quality assessment (PCQA)
The proposed method can model the quality judgingity of subjects through a tailored conditional variational autoencoder (AE)
Experiments indicate that our approach mimics previous cutting-edge methods by a large margin and exhibits cross-dataset experiments.
arXiv Detail & Related papers (2024-01-17T02:25:42Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model [77.19693792957614]
We propose to make neural machine translation (NMT) models quality-aware by training them to estimate the quality of their own output.
We obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.
arXiv Detail & Related papers (2023-10-10T15:33:51Z) - Improving Out-of-Distribution Detection via Epistemic Uncertainty
Adversarial Training [29.4569172720654]
We develop a simple adversarial training scheme that incorporates an attack of the uncertainty predicted by the dropout ensemble.
We demonstrate this method improves OOD detection performance on standard data (i.e., not adversarially crafted), and improves the standardized partial AUC from near-random guessing performance to $geq 0.75$.
arXiv Detail & Related papers (2022-09-05T14:32:19Z) - Balancing Cost and Quality: An Exploration of Human-in-the-loop
Frameworks for Automated Short Answer Scoring [36.58449231222223]
Short answer scoring (SAS) is the task of grading short text written by a learner.
We present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost.
We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
arXiv Detail & Related papers (2022-06-16T16:43:18Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z) - Inducing Predictive Uncertainty Estimation for Face Recognition [102.58180557181643]
We propose a method for generating image quality training data automatically from'mated-pairs' of face images.
We use the generated data to train a lightweight Predictive Confidence Network, termed as PCNet, for estimating the confidence score of a face image.
arXiv Detail & Related papers (2020-09-01T17:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.