Evaluating the Evaluation Metrics for Style Transfer: A Case Study in
Multilingual Formality Transfer
- URL: http://arxiv.org/abs/2110.10668v1
- Date: Wed, 20 Oct 2021 17:21:09 GMT
- Title: Evaluating the Evaluation Metrics for Style Transfer: A Case Study in
Multilingual Formality Transfer
- Authors: Eleftheria Briakou, Sweta Agrawal, Joel Tetreault and Marine Carpuat
- Abstract summary: This work is the first multilingual evaluation of metrics in style transfer (ST)
We evaluate leading ST automatic metrics on the oft-researched task of formality style transfer.
We identify several models that correlate well with human judgments and are robust across languages.
- Score: 11.259786293913606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the field of style transfer (ST) has been growing rapidly, it has been
hampered by a lack of standardized practices for automatic evaluation. In this
paper, we evaluate leading ST automatic metrics on the oft-researched task of
formality style transfer. Unlike previous evaluations, which focus solely on
English, we expand our focus to Brazilian-Portuguese, French, and Italian,
making this work the first multilingual evaluation of metrics in ST. We outline
best practices for automatic evaluation in (formality) style transfer and
identify several models that correlate well with human judgments and are robust
across languages. We hope that this work will help accelerate development in
ST, where human evaluation is often challenging to collect.
Related papers
- Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts.
We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z) - Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation [50.60733773088296]
We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
arXiv Detail & Related papers (2024-06-06T09:18:42Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Text Style Transfer Evaluation Using Large Language Models [24.64611983641699]
Large Language Models (LLMs) have shown their capacity to match and even exceed average human performance.
We compare the results of different LLMs in TST using multiple input prompts.
Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics.
arXiv Detail & Related papers (2023-08-25T13:07:33Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Human Judgement as a Compass to Navigate Automatic Metrics for Formality
Transfer [13.886432536330807]
We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency.
We offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.
arXiv Detail & Related papers (2022-04-15T17:15:52Z) - An Overview on Machine Translation Evaluation [6.85316573653194]
Machine translation (MT) has become one of the important tasks of AI and development.
The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers.
This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
arXiv Detail & Related papers (2022-02-22T16:58:28Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - On the interaction of automatic evaluation and task framing in headline
style transfer [6.27489964982972]
In this paper, we propose an evaluation method for a task involving subtle textual differences, such as style transfer.
We show that it better reflects system differences than traditional metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2021-01-05T16:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.