Language Variety Identification with True Labels
- URL: http://arxiv.org/abs/2303.01490v1
- Date: Thu, 2 Mar 2023 18:51:58 GMT
- Title: Language Variety Identification with True Labels
- Authors: Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha
Kumari, Nishant Nair, Yash Bangera
- Abstract summary: This paper presents DSL True Labels (-TL), the first human-annotated multilingual dataset for language variety identification.
DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English.
We trained multiple models to discriminate between these language varieties, and we present the results in detail.
- Score: 7.9815074811220175
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language identification is an important first step in many IR and NLP
applications. Most publicly available language identification datasets,
however, are compiled under the assumption that the gold label of each instance
is determined by where texts are retrieved from. Research has shown that this
is a problematic assumption, particularly in the case of very similar languages
(e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian
and European Portuguese), where texts may contain no distinctive marker of the
particular language or variety. To overcome this important limitation, this
paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual
dataset for language variety identification. DSL-TL contains a total of 12,900
instances in Portuguese, split between European Portuguese and Brazilian
Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and
English, split between American English and British English. We trained
multiple models to discriminate between these language varieties, and we
present the results in detail. The data and models presented in this paper
provide a reliable benchmark toward the development of robust and fairer
language variety identification systems. We make DSL-TL freely available to the
research community.
Related papers
- Enhancing Portuguese Variety Identification with Cross-Domain Approaches [2.31011809034817]
We develop a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese.
Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages.
arXiv Detail & Related papers (2025-02-20T09:31:48Z) - Multi-label Scandinavian Language Identification (SLIDE) [5.708847945003293]
We focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmral, Norwegian Nynorsk, and Swedish.
We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs.
arXiv Detail & Related papers (2025-02-10T17:16:55Z) - Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets [1.1647644386277962]
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
We propose assessing linguistic diversity of a data set against a reference language sample.
arXiv Detail & Related papers (2024-03-06T18:14:22Z) - Lost in Translation, Found in Spans: Identifying Claims in Multilingual
Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines.
Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem.
We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.