Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight
Monolingual Classification of Registers
- URL: http://arxiv.org/abs/2102.07396v1
- Date: Mon, 15 Feb 2021 08:40:08 GMT
- Title: Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight
Monolingual Classification of Registers
- Authors: Liina Repo, Valtteri Skantsi, Samuel R\"onnqvist, Saara Hellstr\"om,
Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo and
Veronika Laippala
- Abstract summary: We explore cross-lingual transfer of register classification for web documents.
We introduce two new register annotated corpora, FreCORE and SweCORE, for French and Swedish.
Deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish.
- Score: 0.6526029433717663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore cross-lingual transfer of register classification for web
documents. Registers, that is, text varieties such as blogs or news are one of
the primary predictors of linguistic variation and thus affect the automatic
processing of language. We introduce two new register annotated corpora,
FreCORE and SweCORE, for French and Swedish. We demonstrate that deep
pre-trained language models perform strongly in these languages and outperform
previous state-of-the-art in English and Finnish. Specifically, we show 1) that
zero-shot cross-lingual transfer from the large English CORE corpus can match
or surpass previously published monolingual models, and 2) that lightweight
monolingual classification requiring very little training data can reach or
surpass our zero-shot performance. We further analyse classification results
finding that certain registers continue to pose challenges in particular for
cross-lingual transfer.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer [39.360667403003745]
Zero-shot cross-lingual transfer is emerging as a practical solution.
English is the dominant source language for transfer, as reinforced by popular zero-shot benchmarks.
We find that other high-resource languages such as German and Russian often transfer more effectively.
arXiv Detail & Related papers (2021-06-30T16:05:57Z) - Bilingual Alignment Pre-training for Zero-shot Cross-lingual Transfer [33.680292990007366]
In this paper, we aim to improve the zero-shot cross-lingual transfer performance by aligning the embeddings better.
We propose a pre-training task named Alignment Language Model (AlignLM) which uses the statistical alignment information as the prior knowledge to guide bilingual word prediction.
The results show AlignLM can improve the zero-shot performance significantly on MLQA and XNLI datasets.
arXiv Detail & Related papers (2021-06-03T10:18:43Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.