Related papers: The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

URL: http://arxiv.org/abs/2505.12560v2
Date: Sat, 09 Aug 2025 04:32:28 GMT
Title: The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations
Authors: Hiram Ring,
Abstract summary: This paper reports on a large tagged parallel dataset which has been developed to partially address this issue.<n>The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates.<n>The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

Related papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language [48.79534869177174]
We introduce a new pre-training dataset curation pipeline based on FineWeb.<n>We show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets.<n>We scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset.
arXiv Detail & Related papers (2025-06-26T01:01:47Z)
Extending dependencies to the taggedPBC: Word order in transitive clauses [0.0]
This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC.<n>The dependency-annotated corpora are also made available for research and collaboration via GitHub.
arXiv Detail & Related papers (2025-06-07T12:52:45Z)
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora [85.44082712798553]
We introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks.<n>This dataset spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage.<n>Experiments show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
arXiv Detail & Related papers (2025-05-20T07:43:45Z)
ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships [0.0]
Natural Language Inference (NLI) serves as a crucial area within the domain of Natural Language Processing (NLP)<n>This paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships.<n>The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize.
arXiv Detail & Related papers (2025-03-11T18:32:16Z)
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation [28.456351723077088]
This dataset is handcrafted in non-English languages first.<n>Each of these source languages is represented among the 23 languages commonly used by half of the world's population.
arXiv Detail & Related papers (2025-02-06T18:56:37Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages. We leverage parallel translations of the Bible to construct such a dataset. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia [0.0]
We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
arXiv Detail & Related papers (2022-12-14T11:38:48Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks. Datasize and context window size are crucial factors to the transferability. There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models. XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.