Improved Multilingual Language Model Pretraining for Social Media Text
via Translation Pair Prediction
- URL: http://arxiv.org/abs/2110.10318v1
- Date: Wed, 20 Oct 2021 00:06:26 GMT
- Title: Improved Multilingual Language Model Pretraining for Social Media Text
via Translation Pair Prediction
- Authors: Shubhanshu Mishra, Aria Haghighi
- Abstract summary: We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus.
Our approach assumes access to translations between source-target language pairs.
We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese.
- Score: 1.14219428942199
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We evaluate a simple approach to improving zero-shot multilingual transfer of
mBERT on social media corpus by adding a pretraining task called translation
pair prediction (TPP), which predicts whether a pair of cross-lingual texts are
a valid translation. Our approach assumes access to translations (exact or
approximate) between source-target language pairs, where we fine-tune a model
on source language task data and evaluate the model in the target language. In
particular, we focus on language pairs where transfer learning is difficult for
mBERT: those where source and target languages are different in script,
vocabulary, and linguistic typology. We show improvements from TPP pretraining
over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and
Japanese on two social media tasks: NER (a 37% average relative improvement in
F1 across target languages) and sentiment classification (12% relative
improvement in F1) on social media text, while also benchmarking on a
non-social media task of Universal Dependency POS tagging (6.7% relative
improvement in accuracy). Our results are promising given the lack of social
media bitext corpus. Our code can be found at:
https://github.com/twitter-research/multilingual-alignment-tpp.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing
Prediction of Political Polarity in Multilingual News Headlines [0.0]
We use the method of translation and retrieval to acquire the inferential knowledge in the target language.
We then employ an attention mechanism to emphasise important inferences.
We present a dataset of over 62.6K multilingual news headlines in five European languages annotated with their respective political polarities.
arXiv Detail & Related papers (2022-12-01T06:07:01Z) - Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization.
We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer.
Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z) - Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer.
Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.