Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better
  Than Unsupervised?
        - URL: http://arxiv.org/abs/2202.06650v1
 - Date: Mon, 14 Feb 2022 12:06:45 GMT
 - Title: Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better
  Than Unsupervised?
 - Authors: Boshko Koloski and Senja Pollak and Bla\v{z} \v{S}krlj and Matej
  Martinc
 - Abstract summary: We study whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages.
The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages.
We find that the pretrained models fine-tuned on a multilingual corpus covering languages that do not appear in the test set, consistently outscore unsupervised models in all six languages.
 - Score: 8.594972401685649
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Keyword extraction is the task of retrieving words that are essential to the
content of a given document. Researchers proposed various approaches to tackle
this problem. At the top-most level, approaches are divided into ones that
require training - supervised and ones that do not - unsupervised. In this
study, we are interested in settings, where for a language under investigation,
no training data is available. More specifically, we explore whether pretrained
multilingual language models can be employed for zero-shot cross-lingual
keyword extraction on low-resource languages with limited or no available
labeled training data and whether they outperform state-of-the-art unsupervised
keyword extractors. The comparison is conducted on six news article datasets
covering two high-resource languages, English and Russian, and four
low-resource languages, Croatian, Estonian, Latvian, and Slovenian. We find
that the pretrained models fine-tuned on a multilingual corpus covering
languages that do not appear in the test set (i.e. in a zero-shot setting),
consistently outscore unsupervised models in all six languages.
 
       
      
        Related papers
        - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen   Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv  Detail & Related papers  (2024-08-05T07:58:58Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
  Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv  Detail & Related papers  (2024-02-03T10:41:05Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
  via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
 Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv  Detail & Related papers  (2023-01-29T22:30:36Z) - Cross-lingual Transfer Learning for Check-worthy Claim Identification
  over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic.
We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model.
Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv  Detail & Related papers  (2022-11-09T18:18:53Z) - Detecting Languages Unintelligible to Multilingual Models through Local
  Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv  Detail & Related papers  (2022-11-09T16:45:16Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
  Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv  Detail & Related papers  (2021-05-15T23:51:11Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
  Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv  Detail & Related papers  (2021-04-18T05:32:28Z) - Improving the Lexical Ability of Pretrained Language Models for
  Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv  Detail & Related papers  (2021-03-18T21:17:58Z) - Harnessing Multilinguality in Unsupervised Machine Translation for Rare
  Languages [48.28540903568198]
We show that multilinguality is critical to making unsupervised systems practical for low-resource settings.
We present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions.
We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
arXiv  Detail & Related papers  (2020-09-23T15:07:33Z) - A Call for More Rigor in Unsupervised Cross-lingual Learning [76.6545568416577]
An existing rationale for such research is based on the lack of parallel data for many of the world's languages.
We argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice.
arXiv  Detail & Related papers  (2020-04-30T17:06:23Z) - Multilingual acoustic word embedding models for processing zero-resource
  languages [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages.
We then apply it to unseen zero-resource languages.
arXiv  Detail & Related papers  (2020-02-06T05:53:41Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.