A Benchmark for Evaluating Machine Translation Metrics on Dialects
Without Standard Orthography
- URL: http://arxiv.org/abs/2311.16865v1
- Date: Tue, 28 Nov 2023 15:12:11 GMT
- Title: A Benchmark for Evaluating Machine Translation Metrics on Dialects
Without Standard Orthography
- Authors: No\"emi Aepli, Chantal Amrhein, Florian Schottmann, Rico Sennrich
- Abstract summary: We evaluate how robust metrics are to non-standardized dialects.
We collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects.
- Score: 40.04973667048665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For sensible progress in natural language processing, it is important that we
are aware of the limitations of the evaluation metrics we use. In this work, we
evaluate how robust metrics are to non-standardized dialects, i.e. spelling
differences in language varieties that do not have a standard orthography. To
investigate this, we collect a dataset of human translations and human
judgments for automatic machine translations from English to two Swiss German
dialects. We further create a challenge set for dialect variation and benchmark
existing metrics' performances. Our results show that existing metrics cannot
reliably evaluate Swiss German text generation outputs, especially on segment
level. We propose initial design adaptations that increase robustness in the
face of non-standardized dialects, although there remains much room for further
improvement. The dataset, code, and models are available here:
https://github.com/textshuttle/dialect_eval
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z) - Dialect-robust Evaluation of Generated Text [40.85375247260744]
We formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics.
Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust.
arXiv Detail & Related papers (2022-11-02T07:12:23Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.