Related papers: Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings

Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings

URL: http://arxiv.org/abs/2004.15001v2
Date: Tue, 6 Oct 2020 09:50:52 GMT
Title: Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings
Authors: Phillip Keung, Yichao Lu, Julian Salazar, Vikas Bhardwaj
Abstract summary: We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results. We recommend providing oracle scores alongside zero-shot results: still fine-tune using English data, but choose a checkpoint with the target dev set.
Score: 11.042674237070012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on one source language and evaluated on a different target language. However, published results for mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot performance varies greatly at different points in the same fine-tuning run and between different fine-tuning runs. These reproducibility issues are also present for other tasks with different pre-trained embeddings (e.g., MLQA with XLM-R). We recommend providing oracle scores alongside zero-shot results: still fine-tune using English data, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding arbitrarily bad checkpoints.

Related papers

English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports [0.0]
This study is the first comprehensive evaluation of machine translation (MT) performance on bug reports.<n>We analyze the capabilities of DeepL, AWS Translate, and large language models such as ChatGPT, Claude, Gemini, LLaMA, and Mistral.<n>We employ a range of MT evaluation metrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongside classification metrics such as accuracy, precision, recall, and F1-score.
arXiv Detail & Related papers (2025-02-20T07:47:03Z)
Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z)
Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation [79.96416609433724]
Zero-shot translation (ZST) aims to translate between unseen language pairs in training data. The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs. Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem.
arXiv Detail & Related papers (2023-09-28T17:02:36Z)
On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance. We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary. We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z)
Prompt-Tuning Can Be Much Better Than Fine-Tuning on Cross-lingual Understanding With Multilingual Language Models [95.32691891392903]
In this paper, we do cross-lingual evaluation on various NLU tasks using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets.
arXiv Detail & Related papers (2022-10-22T05:48:02Z)
Aligned Weight Regularizers for Pruning Pretrained Neural Networks [6.000551438232907]
We show that there is a clear performance discrepancy in magnitude-based pruning when comparing standard supervised learning to the zero-shot setting. We propose two weight regularizers that aim to maximize the alignment between units of pruned and unpruned networks.
arXiv Detail & Related papers (2022-04-04T11:06:42Z)
On the Relation between Syntactic Divergence and Zero-Shot Performance [22.195133438732633]
We take the transfer of Universal Dependencies (UD) parsing from English to a diverse set of languages and conduct two sets of experiments. We analyze zero-shot performance based on the extent to which English source edges are preserved in translation. In both sets of experiments, our results suggest a strong relation between cross-lingual stability and zero-shot parsing performance.
arXiv Detail & Related papers (2021-10-09T21:09:21Z)
Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages. We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data. Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
Subword Segmentation and a Single Bridge Language Affect Zero-Shot Neural Machine Translation [36.4055239280145]
We investigate zero-shot performance of a multilingual EN$leftrightarrow$FR,CS,DE,FI system trained on WMT data. We observe a bias towards copying the source in zero-shot translation, and investigate how the choice of subword segmentation affects this bias.
arXiv Detail & Related papers (2020-11-03T13:45:54Z)
Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR) AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.