Related papers: FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

URL: http://arxiv.org/abs/2512.13884v1
Date: Mon, 15 Dec 2025 20:36:39 GMT
Title: FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition
Authors: Jonas Golde, Patrick Haller, Alan Akbik,
Abstract summary: We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts.<n>Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotate them with multilingual LLMs.<n>Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings.
Score: 12.125413756152833
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.

Related papers

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments [163.70368742538187]
Apertus is a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem.<n>Apertus models are pretrained exclusively on openly available data, respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content.<n>The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with 40% of pretraining data allocated to non-English content.
arXiv Detail & Related papers (2025-09-17T17:59:21Z)
LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data [5.021795689551854]
LESS (Large Language Model Enhanced Semi-supervised Learning) is a versatile framework that uses Large Language Models (LLMs) to correct pseudo-labels generated on in-the-wild data.<n>Across Mandarin ASR and Spanish-to-English AST evaluations, LESS delivers consistent gains.<n>We have released the recipe as open source to facilitate further research in this area.
arXiv Detail & Related papers (2025-06-05T03:00:04Z)
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.<n>Our approach emphasizes transparency, simplicity, and efficiency.<n>We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction [18.926993352330797]
We propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data. GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, is designed to close the performance gap between high and low-resource languages.
arXiv Detail & Related papers (2023-05-23T01:23:22Z)
mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries. We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results. We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER. We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z)
A Dual-Contrastive Framework for Low-Resource Cross-Lingual Named Entity Recognition [5.030581940990434]
Cross-lingual Named Entity Recognition (NER) has recently become a research hotspot because it can alleviate the data-hungry problem for low-resource languages. In this paper, we describe our novel dual-contrastive framework ConCNER for cross-lingual NER under the scenario of limited source-language labeled data.
arXiv Detail & Related papers (2022-04-02T07:59:13Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR) AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.