Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer
in Low-Resource Dialog Generation
- URL: http://arxiv.org/abs/2305.12480v1
- Date: Sun, 21 May 2023 15:07:04 GMT
- Title: Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer
in Low-Resource Dialog Generation
- Authors: Lei Shen, Shuai Yu and Xiaoyu Shen
- Abstract summary: Cross-lingual transfer is important for developing high-quality chatbots in multiple languages.
In this work, we investigate whether it is helpful to utilize machine translation (MT) at all in this task.
Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese.
- Score: 21.973937517854935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual transfer is important for developing high-quality chatbots in
multiple languages due to the strongly imbalanced distribution of language
resources. A typical approach is to leverage off-the-shelf machine translation
(MT) systems to utilize either the training corpus or developed models from
high-resource languages. In this work, we investigate whether it is helpful to
utilize MT at all in this task. To do so, we simulate a low-resource scenario
assuming access to limited Chinese dialog data in the movie domain and large
amounts of English dialog data from multiple domains. Experiments show that
leveraging English dialog corpora can indeed improve the naturalness, relevance
and cross-domain transferability in Chinese. However, directly using English
dialog corpora in its original form, surprisingly, is better than using its
translated version. As the topics and wording habits in daily conversations are
strongly culture-dependent, MT can reinforce the bias from high-resource
languages, yielding unnatural generations in the target language. Considering
the cost of translating large amounts of text and the strong effects of the
translation quality, we suggest future research should rather focus on
utilizing the original English data for cross-lingual transfer in dialog
generation. We perform extensive human evaluations and ablation studies. The
analysis results, together with the collected dataset, are presented to draw
attention towards this area and benefit future research.
Related papers
- Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study [1.6819960041696331]
In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian.
Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance.
Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement.
arXiv Detail & Related papers (2024-04-12T06:16:26Z) - Cross-Lingual Transfer Learning for Phrase Break Prediction with
Multilingual Language Model [13.730152819942445]
Cross-lingual transfer learning can be particularly effective for improving performance in low-resource languages.
This suggests that cross-lingual transfer can be inexpensive and effective for developing TTS front-end in resource-poor languages.
arXiv Detail & Related papers (2023-06-05T04:10:04Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Modeling Bilingual Conversational Characteristics for Neural Chat
Translation [24.94474722693084]
We aim to promote the translation quality of conversational text by modeling the above properties.
We evaluate our approach on the benchmark dataset BConTrasT (English-German) and a self-collected bilingual dialogue corpus, named BMELD (English-Chinese)
Our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER.
arXiv Detail & Related papers (2021-07-23T12:23:34Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - An Empirical Study of Cross-Lingual Transferability in Generative
Dialogue State Tracker [33.2309643963072]
We study the transferability of a cross-lingual generative dialogue state tracking system using a multilingual pre-trained seq2seq model.
We also find out the low cross-lingual transferability of our approaches and provides investigation and discussion.
arXiv Detail & Related papers (2021-01-27T12:45:55Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Multilingual Argument Mining: Datasets and Analysis [9.117984896907782]
We explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages.
We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments.
We provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.
arXiv Detail & Related papers (2020-10-13T14:49:10Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.