Hacking Neural Evaluation Metrics with Single Hub Text
- URL: http://arxiv.org/abs/2512.16323v1
- Date: Thu, 18 Dec 2025 09:06:24 GMT
- Title: Hacking Neural Evaluation Metrics with Single Hub Text
- Authors: Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai,
- Abstract summary: We propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality.<n>The method achieves 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively.<n>We also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
- Score: 6.572810068286891
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
Related papers
- Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z) - How to Evaluate Speech Translation with Source-Aware Neural MT Metrics [32.41110835446445]
In machine translation, neural metrics incorporating the source text achieve stronger correlation with human judgments.<n>In this work, we conduct the first systematic study of source-aware metrics for speech-to-text translation.<n>We introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations.
arXiv Detail & Related papers (2025-11-05T08:49:22Z) - Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages [13.098470937627871]
ITEM systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages.<n>Findings offer critical guidance for advancing metric design and evaluation in Indian languages.
arXiv Detail & Related papers (2025-10-08T14:27:02Z) - Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark [11.068031181100276]
We study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani)<n>We uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation.<n>We advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts.
arXiv Detail & Related papers (2025-08-28T07:52:42Z) - BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Understanding and Mitigating the Uncertainty in Zero-Shot Translation [92.25357943169601]
We aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation.
We propose two lightweight and complementary approaches to denoise the training data for model training.
Our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines.
arXiv Detail & Related papers (2022-05-20T10:29:46Z) - NMTScore: A Multilingual Analysis of Translation-based Text Similarity
Measures [42.46681912294797]
We analyze translation-based similarity measures in the common framework of multilingual NMT.
Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification.
Measures show a relatively high correlation to human judgments.
arXiv Detail & Related papers (2022-04-28T17:57:17Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.