Improving Metrics for Speech Translation
- URL: http://arxiv.org/abs/2305.12918v1
- Date: Mon, 22 May 2023 11:01:38 GMT
- Title: Improving Metrics for Speech Translation
- Authors: Claudio Paonessa and Dominik Frefel and Manfred Vogel
- Abstract summary: We introduce Parallel Paraphrasing ($textPara_textboth$), an augmentation method for translation metrics making use of automatic paraphrasing of both the reference and hypothesis.
We show that we are able to significantly improve the correlation with human quality perception if our method is applied to commonly used metrics.
- Score: 1.2891210250935146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Parallel Paraphrasing ($\text{Para}_\text{both}$), an
augmentation method for translation metrics making use of automatic
paraphrasing of both the reference and hypothesis. This method counteracts the
typically misleading results of speech translation metrics such as WER, CER,
and BLEU if only a single reference is available. We introduce two new datasets
explicitly created to measure the quality of metrics intended to be applied to
Swiss German speech-to-text systems. Based on these datasets, we show that we
are able to significantly improve the correlation with human quality perception
if our method is applied to commonly used metrics.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Training and Meta-Evaluating Machine Translation Evaluation Metrics at
the Paragraph Level [23.47729750104952]
We propose a method for creating paragraph-level data for training and meta-evaluating metrics.
Experiments show that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level.
arXiv Detail & Related papers (2023-08-25T17:31:46Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Embarrassingly Easy Document-Level MT Metrics: How to Convert Any
Pretrained Metric Into a Document-Level Metric [15.646714712131148]
We present a method for extending pretrained metrics to incorporate context at the document level.
We show that the extended metrics outperform their sentence-level counterparts in about 85% of the tested conditions.
Our experimental results support our initial hypothesis and show that a simple extension of the metrics permits them to take advantage of context.
arXiv Detail & Related papers (2022-09-27T19:42:22Z) - InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [27.129551973093008]
InfoLM is a family of untrained metrics that can be viewed as a string-based metric.
This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
arXiv Detail & Related papers (2021-12-02T20:09:29Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Measuring and Increasing Context Usage in Context-Aware Machine
Translation [64.5726087590283]
We introduce a new metric, conditional cross-mutual information, to quantify the usage of context by machine translation models.
We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models.
arXiv Detail & Related papers (2021-05-07T19:55:35Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.