SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
- URL: http://arxiv.org/abs/2309.08832v2
- Date: Tue, 2 Apr 2024 09:36:24 GMT
- Title: SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
- Authors: Vikas Raunak, Tom Kocmi, Matt Post,
- Abstract summary: We present a metric named SLIDE (SLIding Document Evaluator) which operates on blocks of sentences.
We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline.
This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities.
- Score: 24.524282909076767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. In this paper, we investigate whether additional source context can effectively substitute for a reference. We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations.
Related papers
- Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Revisiting the Evaluation Metrics of Paraphrase Generation [35.6803390044542]
Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
arXiv Detail & Related papers (2022-02-17T07:18:54Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - A Comparison of Approaches to Document-level Machine Translation [34.2276281264886]
This paper presents a systematic comparison of selected approaches to document-level phenomena evaluation suites.
We find that a simple method based purely on back-translating monolingual document-level data performs as well as much more elaborate alternatives.
arXiv Detail & Related papers (2021-01-26T19:21:09Z) - Document-Level Definition Detection in Scholarly Documents: Existing
Models, Error Analyses, and Future Directions [40.64025648548128]
We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and filters, and evaluate it on a standard sentence-level benchmark.
HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively.
arXiv Detail & Related papers (2020-10-11T01:16:10Z) - Document-level Neural Machine Translation with Document Embeddings [82.4684444847092]
This work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings.
The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end.
arXiv Detail & Related papers (2020-09-16T19:43:29Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.