Related papers: Rethinking the Evaluation of Neural Machine Translation

Rethinking the Evaluation of Neural Machine Translation

URL: http://arxiv.org/abs/2106.15217v1
Date: Tue, 29 Jun 2021 09:59:50 GMT
Title: Rethinking the Evaluation of Neural Machine Translation
Authors: Jianhao Yan, Chenming Wu, Fandong Meng, Jie Zhou
Abstract summary: We propose a novel evaluation protocol, which avoids the effect of search errors and provides a system-level evaluation in the perspective of model ranking. Our method is based on our newly proposed exact top-$k$ decoding instead of beam search.
Score: 25.036685025571927
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The evaluation of neural machine translation systems is usually built upon generated translation of a certain decoding method (e.g., beam search) with evaluation metrics over the generated translation (e.g., BLEU). However, this evaluation framework suffers from high search errors brought by heuristic search algorithms and is limited by its nature of evaluation over one best candidate. In this paper, we propose a novel evaluation protocol, which not only avoids the effect of search errors but provides a system-level evaluation in the perspective of model ranking. In particular, our method is based on our newly proposed exact top-$k$ decoding instead of beam search. Our approach evaluates model errors by the distance between the candidate spaces scored by the references and the model respectively. Extensive experiments on WMT'14 English-German demonstrate that bad ranking ability is connected to the well-known beam search curse, and state-of-the-art Transformer models are facing serious ranking errors. By evaluating various model architectures and techniques, we provide several interesting findings. Finally, to effectively approximate the exact search algorithm with same time cost as original beam search, we present a minimum heap augmented beam search algorithm.

Related papers

On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate. We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches. It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z)
Rank-DETR for High Quality Object Detection [52.82810762221516]
A highly performant object detector requires accurate ranking for the bounding box predictions. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs.
arXiv Detail & Related papers (2023-10-13T04:48:32Z)
MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT) We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods. We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z)
Enabling arbitrary translation objectives with Adaptive Tree Search [23.40984370716434]
We introduce an adaptive tree search algorithm that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. Our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models.
arXiv Detail & Related papers (2022-02-23T11:48:26Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)
Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation [20.76001576262768]
We show that a sampling-based approximation to minimum Bayes risk (MBR) decoding has no equivalent to the beam search curse. We also show that it can be beneficial to make use of strategies like beam search and nucleus sampling to construct hypothesis spaces efficiently.
arXiv Detail & Related papers (2021-08-10T14:35:24Z)
Machine Translation Decoding beyond Beam Search [43.27883368285612]
Beam search is the go-to method for decoding auto-regressive machine translation models. Our aim is to establish whether beam search can be replaced by a more powerful metric-driven search technique. We introduce a Monte-Carlo Tree Search (MCTS) based method and showcase its competitiveness.
arXiv Detail & Related papers (2021-04-12T10:28:17Z)
AP-Loss for Accurate One-Stage Object Detection [49.13608882885456]
One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously. The former suffers much from extreme foreground-background imbalance due to the large number of anchors. This paper proposes a novel framework to replace the classification task in one-stage detectors with a ranking task.
arXiv Detail & Related papers (2020-08-17T13:22:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.