Related papers: The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

URL: http://arxiv.org/abs/2601.11170v1
Date: Fri, 16 Jan 2026 10:38:19 GMT
Title: The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Authors: Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel, Nikola Ljubešić,
Abstract summary: CLASSLA-web 2.0 corpus collection contains 17.0 billion words in 38.1 million texts in seven languages.<n>New web corpora is automatically annotated with topic labels.<n>Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap.
Score: 0.5666456827479577
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

Related papers

Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective [11.062766066639398]
English is the predominant language on the web, powering nearly half of the world's top ten million websites.<n>Support for multilingual content is growing, with many websites combining English with regional or native languages in both visible content and hidden metadata.<n>This multilingualism introduces significant barriers for users with visual impairments.<n>We introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts.
arXiv Detail & Related papers (2025-08-25T02:29:57Z)
Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages.<n>We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic).<n>We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z)
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation [4.450536872346658]
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline.
arXiv Detail & Related papers (2024-03-19T13:30:47Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z)
esCorpius: A Massive Spanish Crawling Corpus [2.262838186547612]
esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
arXiv Detail & Related papers (2022-06-30T09:29:18Z)
Language-Agnostic Website Embedding and Classification [12.86558129722198]
We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie. We introduce Homepage2Vec, a machine-learned model for classifying and embedding websites based on their homepage. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages.
arXiv Detail & Related papers (2022-01-10T22:31:48Z)
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z)
\textit{StateCensusLaws.org}: A Web Application for Consuming and Annotating Legal Discourse Learning [89.77347919191774]
We create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text. We focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government.
arXiv Detail & Related papers (2021-04-20T22:00:54Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.