UXLA: A Robust Unsupervised Data Augmentation Framework for
Zero-Resource Cross-Lingual NLP
- URL: http://arxiv.org/abs/2004.13240v4
- Date: Sat, 26 Jun 2021 04:16:43 GMT
- Title: UXLA: A Robust Unsupervised Data Augmentation Framework for
Zero-Resource Cross-Lingual NLP
- Authors: M Saiful Bari, Tasnim Mohiuddin, Shafiq Joty
- Abstract summary: We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios.
In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution.
At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection.
- Score: 19.65783178853385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transfer learning has yielded state-of-the-art (SoTA) results in many
supervised NLP tasks. However, annotated data for every target task in every
target language is rare, especially for low-resource languages. We propose
UXLA, a novel unsupervised data augmentation framework for zero-resource
transfer learning scenarios. In particular, UXLA aims to solve cross-lingual
adaptation problems from a source language task distribution to an unknown
target language task distribution, assuming no training label in the target
language. At its core, UXLA performs simultaneous self-training with data
augmentation and unsupervised sample selection. To show its effectiveness, we
conduct extensive experiments on three diverse zero-resource cross-lingual
transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the
baselines by a good margin. With an in-depth framework dissection, we
demonstrate the cumulative contributions of different components to its
success.
Related papers
- Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection [23.575482348558904]
Large language models (LLMs) are very proficient text generators.
We leverage this capability to generate task-specific data via zero-shot prompting.
We observe significant performance gains across sentiment analysis and natural language inference tasks.
arXiv Detail & Related papers (2024-07-15T10:00:22Z) - ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source
Ensembling of Language Adapters [29.211715245603234]
We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via the use of language adapters (LAs)
Training target LA requires unlabeled data, which may not be readily available for low resource unseen languages.
We extend ZGUL to settings where either (1) some unlabeled data or (2) few-shot training examples are available for the target language.
arXiv Detail & Related papers (2023-10-25T06:22:29Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language
Generation [4.874780144224057]
Cross-lingual transfer for natural language generation is relatively understudied.
We consider four NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages.
We propose an unsupervised cross-lingual language generation framework (called ZmBART) that does not use any parallel or pseudo-parallel/back-translated data.
arXiv Detail & Related papers (2021-06-03T05:08:01Z) - XeroAlign: Zero-Shot Cross-lingual Transformer Alignment [9.340611077939828]
We introduce a method for task-specific alignment of cross-lingual pretrained transformers such as XLM-R.
XeroAlign uses translated task data to encourage the model to generate similar sentence embeddings for different languages.
XLM-RA's text classification accuracy exceeds that of XLM-R trained with labelled data and performs on par with state-of-the-art models on a cross-lingual adversarial paraphrasing task.
arXiv Detail & Related papers (2021-05-06T07:10:00Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.