Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better
Than Unsupervised?
- URL: http://arxiv.org/abs/2202.06650v1
- Date: Mon, 14 Feb 2022 12:06:45 GMT
- Title: Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better
Than Unsupervised?
- Authors: Boshko Koloski and Senja Pollak and Bla\v{z} \v{S}krlj and Matej
Martinc
- Abstract summary: We study whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages.
The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages.
We find that the pretrained models fine-tuned on a multilingual corpus covering languages that do not appear in the test set, consistently outscore unsupervised models in all six languages.
- Score: 8.594972401685649
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Keyword extraction is the task of retrieving words that are essential to the
content of a given document. Researchers proposed various approaches to tackle
this problem. At the top-most level, approaches are divided into ones that
require training - supervised and ones that do not - unsupervised. In this
study, we are interested in settings, where for a language under investigation,
no training data is available. More specifically, we explore whether pretrained
multilingual language models can be employed for zero-shot cross-lingual
keyword extraction on low-resource languages with limited or no available
labeled training data and whether they outperform state-of-the-art unsupervised
keyword extractors. The comparison is conducted on six news article datasets
covering two high-resource languages, English and Russian, and four
low-resource languages, Croatian, Estonian, Latvian, and Slovenian. We find
that the pretrained models fine-tuned on a multilingual corpus covering
languages that do not appear in the test set (i.e. in a zero-shot setting),
consistently outscore unsupervised models in all six languages.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - Cross-lingual Transfer Learning for Check-worthy Claim Identification
over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic.
We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model.
Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv Detail & Related papers (2022-11-09T18:18:53Z) - Detecting Languages Unintelligible to Multilingual Models through Local
Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Harnessing Multilinguality in Unsupervised Machine Translation for Rare
Languages [48.28540903568198]
We show that multilinguality is critical to making unsupervised systems practical for low-resource settings.
We present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions.
We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
arXiv Detail & Related papers (2020-09-23T15:07:33Z) - A Call for More Rigor in Unsupervised Cross-lingual Learning [76.6545568416577]
An existing rationale for such research is based on the lack of parallel data for many of the world's languages.
We argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice.
arXiv Detail & Related papers (2020-04-30T17:06:23Z) - Multilingual acoustic word embedding models for processing zero-resource
languages [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages.
We then apply it to unseen zero-resource languages.
arXiv Detail & Related papers (2020-02-06T05:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.