ScandEval: A Benchmark for Scandinavian Natural Language Processing
- URL: http://arxiv.org/abs/2304.00906v1
- Date: Mon, 3 Apr 2023 11:51:46 GMT
- Title: ScandEval: A Benchmark for Scandinavian Natural Language Processing
- Authors: Dan Saattrup Nielsen
- Abstract summary: This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages.
The datasets used in two of the tasks, linguistic acceptability and question answering, are new.
We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a Scandinavian benchmarking platform, ScandEval, which
can benchmark any pretrained model on four different tasks in the Scandinavian
languages. The datasets used in two of the tasks, linguistic acceptability and
question answering, are new. We develop and release a Python package and
command-line interface, scandeval, which can benchmark any model that has been
uploaded to the Hugging Face Hub, with reproducible results. Using this
package, we benchmark more than 100 Scandinavian or multilingual models and
present the results of these in an interactive online leaderboard, as well as
provide an analysis of the results. The analysis shows that there is
substantial cross-lingual transfer among the Mainland Scandinavian languages
(Danish, Swedish and Norwegian), with limited cross-lingual transfer between
the group of Mainland Scandinavian languages and the group of Insular
Scandinavian languages (Icelandic and Faroese). The benchmarking results also
show that the investment in language technology in Norway, Sweden and Denmark
has led to language models that outperform massively multilingual models such
as XLM-RoBERTa and mDeBERTaV3. We release the source code for both the package
and leaderboard.
Related papers
- SWEb: A Large Web Dataset for the Scandinavian Languages [11.41086713693524]
This paper presents the largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb)
We introduce a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches.
We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results.
arXiv Detail & Related papers (2024-10-06T11:55:15Z) - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark [0.0]
Large language models (LLMs) have demonstrated significant capabilities across numerous applications.
This study introduces a comprehensive human benchmark to assess the efficacy of prominent LLMs in understanding and generating Swedish language texts.
arXiv Detail & Related papers (2024-05-22T21:22:51Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Cross-Lingual Knowledge Distillation for Answer Sentence Selection in
Low-Resource Languages [90.41827664700847]
We propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages.
To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages.
arXiv Detail & Related papers (2023-05-25T17:56:04Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language
Models [0.0]
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks.
We introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD)
We show that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages.
arXiv Detail & Related papers (2022-01-14T18:45:31Z) - Operationalizing a National Digital Library: The Case for a Norwegian
Transformer Model [0.0]
We show the process of building a large-scale training set from digital and digitized collections at a national library.
The resulting Bidirectional Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks.
arXiv Detail & Related papers (2021-04-19T20:36:24Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.