Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
- URL: http://arxiv.org/abs/2406.19892v2
- Date: Fri, 19 Jul 2024 20:40:53 GMT
- Title: Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
- Authors: Erik Henriksson, Amanda Myntti, Anni Eskelinen, Selcen Erten-Johansson, Saara Hellström, Veronika Laippala,
- Abstract summary: This article explores deep learning models for the automatic identification of registers in web-based datasets across 16 languages.
We show that multilingual models consistently outperform monolingual ones, especially benefiting languages with fewer training examples and smaller registers.
- Score: 1.1456104143595247
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This article explores deep learning models for the automatic identification of registers - text varieties such as news reports and discussion forums - in web-based datasets across 16 languages. Identifying web registers, or genres, is crucial for understanding the content of web-scale datasets, which have become essential in corpus and computational linguistics. Despite recent advances, the full potential of register classifiers in the noisy, unrestricted web remains largely unexplored, particularly in multilingual settings. We experiment with various deep learning models using the Multilingual CORE corpora, newly introduced in this article, which includes 16 languages annotated with a detailed, hierarchical taxonomy of 25 registers designed to cover the entire web. Our classifiers achieve state-of-the-art results using a multi-label approach, demonstrating that competitive performance is possible using a relatively complex register taxonomy. However, all models hit a performance ceiling at approximately 80% F1 score, which we attribute to the non-discrete nature of web registers and the inherent uncertainty in labeling some documents. By pruning ambiguous examples, we enhance model performance to over 90%. Additionally, multilingual models consistently outperform monolingual ones, especially benefiting languages with fewer training examples and smaller registers. Although a zero-shot setting reduces performance by an average of 7%, these drops are not correlated with specific registers or languages. Instead, we find that registers are surprisingly similar across languages.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Using Language Models on Low-end Hardware [17.33390660481404]
This paper evaluates the viability of using fixed language models for training text classification networks on low-end hardware.
We combine language models with a CNN architecture and put together a comprehensive benchmark with 8 datasets covering single-label and multi-label classification of topic, sentiment, and genre.
arXiv Detail & Related papers (2023-05-03T18:00:03Z) - Language-Agnostic Website Embedding and Classification [12.86558129722198]
We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie.
We introduce Homepage2Vec, a machine-learned model for classifying and embedding websites based on their homepage.
We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages.
arXiv Detail & Related papers (2022-01-10T22:31:48Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight
Monolingual Classification of Registers [0.6526029433717663]
We explore cross-lingual transfer of register classification for web documents.
We introduce two new register annotated corpora, FreCORE and SweCORE, for French and Swedish.
Deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish.
arXiv Detail & Related papers (2021-02-15T08:40:08Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.