Related papers: Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers

Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers

URL: http://arxiv.org/abs/2406.19892v2
Date: Fri, 19 Jul 2024 20:40:53 GMT
Title: Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
Authors: Erik Henriksson, Amanda Myntti, Anni Eskelinen, Selcen Erten-Johansson, Saara Hellström, Veronika Laippala,
Abstract summary: This article explores deep learning models for the automatic identification of registers in web-based datasets across 16 languages. We show that multilingual models consistently outperform monolingual ones, especially benefiting languages with fewer training examples and smaller registers.
Score: 1.1456104143595247
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This article explores deep learning models for the automatic identification of registers - text varieties such as news reports and discussion forums - in web-based datasets across 16 languages. Identifying web registers, or genres, is crucial for understanding the content of web-scale datasets, which have become essential in corpus and computational linguistics. Despite recent advances, the full potential of register classifiers in the noisy, unrestricted web remains largely unexplored, particularly in multilingual settings. We experiment with various deep learning models using the Multilingual CORE corpora, newly introduced in this article, which includes 16 languages annotated with a detailed, hierarchical taxonomy of 25 registers designed to cover the entire web. Our classifiers achieve state-of-the-art results using a multi-label approach, demonstrating that competitive performance is possible using a relatively complex register taxonomy. However, all models hit a performance ceiling at approximately 80% F1 score, which we attribute to the non-discrete nature of web registers and the inherent uncertainty in labeling some documents. By pruning ambiguous examples, we enhance model performance to over 90%. Additionally, multilingual models consistently outperform monolingual ones, especially benefiting languages with fewer training examples and smaller registers. Although a zero-shot setting reduces performance by an average of 7%, these drops are not correlated with specific registers or languages. Instead, we find that registers are surprisingly similar across languages.

Related papers

Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods. Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z)
Using Language Models on Low-end Hardware [17.33390660481404]
This paper evaluates the viability of using fixed language models for training text classification networks on low-end hardware. We combine language models with a CNN architecture and put together a comprehensive benchmark with 8 datasets covering single-label and multi-label classification of topic, sentiment, and genre.
arXiv Detail & Related papers (2023-05-03T18:00:03Z)
Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities [35.15674061731237]
This paper explores large-scale multilingual ASR models on 70 languages. We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.
arXiv Detail & Related papers (2022-11-10T18:43:42Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
Language-Agnostic Website Embedding and Classification [12.86558129722198]
We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie. We introduce Homepage2Vec, a machine-learned model for classifying and embedding websites based on their homepage. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages.
arXiv Detail & Related papers (2022-01-10T22:31:48Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z)
Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers [0.6526029433717663]
We explore cross-lingual transfer of register classification for web documents. We introduce two new register annotated corpora, FreCORE and SweCORE, for French and Swedish. Deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish.
arXiv Detail & Related papers (2021-02-15T08:40:08Z)
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus [15.807197703827818]
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets. We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
arXiv Detail & Related papers (2020-10-27T19:29:17Z)
Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.