CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
- URL: http://arxiv.org/abs/2601.18026v1
- Date: Sun, 25 Jan 2026 22:49:30 GMT
- Title: CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
- Authors: Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate, Shamsuddeen Hassan Muhammad, Atnafu Lambebo Tonja, Hend Al-Khalifa, Nadia Ghezaiel Hammouda, Verrah Otiende, Tack Hwa Wong, Jakhongir Saydaliev, Melika Nobakhtian, Muhammad Ravi Shulthan Habibi, Chalamalasetti Kranti, Carol Muchemi, Khang Nguyen, Faisal Muhammad Adam, Luis Frentzen Salim, Reem Alqifari, Cynthia Amol, Joseph Marvin Imperial, Ilker Kesen, Ahmad Mustafid, Pavel Stepachev, Leshem Choshen, David Anugraha, Hamada Nayel, Seid Muhie Yimam, Vallerie Alexandra Putra, My Chiffon Nguyen, Azmine Toushik Wasi, Gouthami Vadithya, Rob van der Goot, Lanwenn ar C'horr, Karan Dua, Andrew Yates, Mithil Bangera, Yeshil Bangera, Hitesh Laxmichand Patel, Shu Okabe, Fenal Ashokbhai Ilasariya, Dmitry Gaynullin, Genta Indra Winata, Yiyuan Li, Juan Pablo Martínez, Amit Agarwal, Ikhlasul Akmal Hanif, Raia Abu Ahmad, Esther Adenuga, Filbert Aurelian Tjiaranata, Weerayut Buaphet, Michael Anugraha, Sowmya Vajjala, Benjamin Rice, Azril Hafizi Amirudin, Jesujoba O. Alabi, Srikant Panda, Yassine Toughrai, Bruhan Kyomuhendo, Daniel Ruffinelli, Akshata A, Manuel Goulão, Ej Zhou, Ingrid Gabriela Franco Ramirez, Cristina Aggazzotti, Konstantin Dobler, Jun Kevin, Quentin Pagès, Nicholas Andrews, Nuhu Ibrahim, Mattes Ruckdeschel, Amr Keleg, Mike Zhang, Casper Muziri, Saron Samuel, Sotaro Takeshita, Kun Kerdthaisong, Luca Foppiano, Rasul Dent, Tommaso Green, Ahmad Mustapha Wali, Kamohelo Makaaka, Vicky Feliren, Inshirah Idris, Hande Celikkanat, Abdulhamid Abubakar, Jean Maillard, Benoît Sagot, Thibault Clérice, Kenton Murray, Sarah Luger,
- Abstract summary: We introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain.<n>We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models.<n>We highlight that existing evaluations overestimate LID accuracy for many languages in the web domain.
- Score: 56.043078390377076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
Related papers
- What Language is This? Ask Your Tokenizer [32.28976119949841]
Language Identification (LID) is an important component of many multilingual natural language processing pipelines.<n>We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm.<n>Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models.
arXiv Detail & Related papers (2026-02-19T18:58:39Z) - DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching [8.14614722074297]
Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data.<n>This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains.<n>Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs.
arXiv Detail & Related papers (2025-09-22T13:32:31Z) - ConLID: Supervised Contrastive Learning for Low-Resource Language Identification [14.504528263331075]
We propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages.<n>We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%.
arXiv Detail & Related papers (2025-06-18T09:35:33Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages [15.203789021094982]
In large language models (LLMs), how are multiple languages learned and encoded?<n>We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages.
arXiv Detail & Related papers (2025-01-10T21:18:21Z) - mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation [28.531581489405745]
mHumanEval is an extended benchmark supporting prompts in over 200 natural languages.<n>We provide expert human translations for 15 diverse natural languages (NLs)<n>We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs.
arXiv Detail & Related papers (2024-10-19T08:44:26Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support [18.810320088441678]
This work introduces a novel NLP benchmark for the legal domain.
It challenges LLMs in five key dimensions: processing emphlong documents (up to 50K tokens), using emphdomain-specific knowledge (embodied in legal texts) and emphmultilingual understanding (covering five languages)
Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system.
arXiv Detail & Related papers (2023-06-15T16:19:15Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.