Multi-label Scandinavian Language Identification (SLIDE)
- URL: http://arxiv.org/abs/2502.06692v1
- Date: Mon, 10 Feb 2025 17:16:55 GMT
- Title: Multi-label Scandinavian Language Identification (SLIDE)
- Authors: Mariia Fedorova, Jonas Sebulon Frydenberg, Victoria Handford, Victoria Ovedie Chruickshank Langø, Solveig Helene Willoch, Marthe Løken Midtgaard, Yves Scherrer, Petter Mæhlum, David Samuel,
- Abstract summary: We focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmral, Norwegian Nynorsk, and Swedish.
We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs.
- Score: 5.708847945003293
- License:
- Abstract: Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm\r{a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Robust Open-Set Spoken Language Identification and the CU MultiLang
Dataset [2.048226951354646]
Open-set spoken language identification systems can detect when an input exhibits none of the original languages.
We implement a novel approach to open-set spoken language identification that uses MFCC and pitch features.
We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly.
arXiv Detail & Related papers (2023-08-29T00:44:27Z) - Transfer-Free Data-Efficient Multilingual Slot Labeling [82.02076369811402]
Slot labeling is a core component of task-oriented dialogue (ToD) systems.
To mitigate the inherent data scarcity issue, current research on multilingual ToD assumes that sufficient English-language annotated data are always available.
We propose a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers.
arXiv Detail & Related papers (2023-05-22T22:47:32Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - ScandEval: A Benchmark for Scandinavian Natural Language Processing [0.0]
This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages.
The datasets used in two of the tasks, linguistic acceptability and question answering, are new.
We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results.
arXiv Detail & Related papers (2023-04-03T11:51:46Z) - Language Variety Identification with True Labels [7.9815074811220175]
This paper presents DSL True Labels (-TL), the first human-annotated multilingual dataset for language variety identification.
DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English.
We trained multiple models to discriminate between these language varieties, and we present the results in detail.
arXiv Detail & Related papers (2023-03-02T18:51:58Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models [1.827510863075184]
Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots.
Existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure.
Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set.
arXiv Detail & Related papers (2021-09-29T11:15:00Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.