Related papers: Language-Agnostic Website Embedding and Classification

Language-Agnostic Website Embedding and Classification

URL: http://arxiv.org/abs/2201.03677v1
Date: Mon, 10 Jan 2022 22:31:48 GMT
Title: Language-Agnostic Website Embedding and Classification
Authors: Sylvain Lugeon, Tiziano Piccardi, Robert West
Abstract summary: We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie. We introduce Homepage2Vec, a machine-learned model for classifying and embedding websites based on their homepage. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages.
Score: 12.86558129722198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie, the largest multilingual crowdsourced Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and can generate embeddings representation. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries.

Related papers

Multilingual Language Model Pretraining using Machine-translated Data [33.373858866989536]
We translate FineWeb-Edu, a high-quality English web dataset, into nine languages. We show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data.
arXiv Detail & Related papers (2025-02-18T19:27:53Z)
Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic). We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z)
Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers [1.1456104143595247]
This article explores deep learning models for the automatic identification of registers in web-based datasets across 16 languages. We show that multilingual models consistently outperform monolingual ones, especially benefiting languages with fewer training examples and smaller registers.
arXiv Detail & Related papers (2024-06-28T13:00:30Z)
A New Massive Multilingual Dataset for High-Performance Language Technologies [14.375854322321997]
The HPLT language resources are a new massive multilingual dataset including both monolingual and bilingual corpora. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens.
arXiv Detail & Related papers (2024-03-20T22:14:39Z)
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models. We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings. The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages. Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs. Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.