Rethinking the Evaluation of Neural Machine Translation
- URL: http://arxiv.org/abs/2106.15217v1
- Date: Tue, 29 Jun 2021 09:59:50 GMT
- Title: Rethinking the Evaluation of Neural Machine Translation
- Authors: Jianhao Yan, Chenming Wu, Fandong Meng, Jie Zhou
- Abstract summary: We propose a novel evaluation protocol, which avoids the effect of search errors and provides a system-level evaluation in the perspective of model ranking.
Our method is based on our newly proposed exact top-$k$ decoding instead of beam search.
- Score: 25.036685025571927
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The evaluation of neural machine translation systems is usually built upon
generated translation of a certain decoding method (e.g., beam search) with
evaluation metrics over the generated translation (e.g., BLEU). However, this
evaluation framework suffers from high search errors brought by heuristic
search algorithms and is limited by its nature of evaluation over one best
candidate. In this paper, we propose a novel evaluation protocol, which not
only avoids the effect of search errors but provides a system-level evaluation
in the perspective of model ranking. In particular, our method is based on our
newly proposed exact top-$k$ decoding instead of beam search. Our approach
evaluates model errors by the distance between the candidate spaces scored by
the references and the model respectively. Extensive experiments on WMT'14
English-German demonstrate that bad ranking ability is connected to the
well-known beam search curse, and state-of-the-art Transformer models are
facing serious ranking errors. By evaluating various model architectures and
techniques, we provide several interesting findings. Finally, to effectively
approximate the exact search algorithm with same time cost as original beam
search, we present a minimum heap augmented beam search algorithm.
Related papers
- xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z) - Rank-DETR for High Quality Object Detection [52.82810762221516]
A highly performant object detector requires accurate ranking for the bounding box predictions.
In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs.
arXiv Detail & Related papers (2023-10-13T04:48:32Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT)
We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods.
We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z) - Enabling arbitrary translation objectives with Adaptive Tree Search [23.40984370716434]
We introduce an adaptive tree search algorithm that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective.
Our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models.
arXiv Detail & Related papers (2022-02-23T11:48:26Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Sampling-Based Minimum Bayes Risk Decoding for Neural Machine
Translation [20.76001576262768]
We show that a sampling-based approximation to minimum Bayes risk (MBR) decoding has no equivalent to the beam search curse.
We also show that it can be beneficial to make use of strategies like beam search and nucleus sampling to construct hypothesis spaces efficiently.
arXiv Detail & Related papers (2021-08-10T14:35:24Z) - Machine Translation Decoding beyond Beam Search [43.27883368285612]
Beam search is the go-to method for decoding auto-regressive machine translation models.
Our aim is to establish whether beam search can be replaced by a more powerful metric-driven search technique.
We introduce a Monte-Carlo Tree Search (MCTS) based method and showcase its competitiveness.
arXiv Detail & Related papers (2021-04-12T10:28:17Z) - AP-Loss for Accurate One-Stage Object Detection [49.13608882885456]
One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously.
The former suffers much from extreme foreground-background imbalance due to the large number of anchors.
This paper proposes a novel framework to replace the classification task in one-stage detectors with a ranking task.
arXiv Detail & Related papers (2020-08-17T13:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.