A Collection of Question Answering Datasets for Norwegian
- URL: http://arxiv.org/abs/2501.11128v1
- Date: Sun, 19 Jan 2025 17:42:48 GMT
- Title: A Collection of Question Answering Datasets for Norwegian
- Authors: Vladislav Mikhailov, Petter Mæhlum, Victoria Ovedie Chruickshank Langø, Erik Velldal, Lilja Øvrelid,
- Abstract summary: The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway.<n>Our datasets comprise over 10k question-answer pairs, created by native speakers.<n>Most LMs perform better in Bokmaal than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions.
- Score: 6.149436325733799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokm{\aa}l and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokm{\aa}l than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
Related papers
- NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark [10.018089141563104]
NorEval consists of 24 high-quality human-created datasets.
It covers a broad spectrum of task categories targeting Norwegian language understanding and generation.
It focuses on both of the official written standards of the Norwegian language: Bokmaal and Nynorsk.
arXiv Detail & Related papers (2025-04-10T13:44:55Z) - Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles [8.083472758337559]
We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian.<n>The dataset is intended for benchmarking the abstractive summarisation capabilities of generative language models.
arXiv Detail & Related papers (2025-01-13T22:08:29Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk.
We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets.
We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z) - NorQuAD: Norwegian Question Answering Dataset [0.03281128493853064]
The dataset consists of 4,752 manually created question-answer pairs.
We benchmark several multilingual and Norwegian monolingual language models on the dataset and compare them against human performance.
The dataset will be made freely available.
arXiv Detail & Related papers (2023-05-03T08:17:07Z) - The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling [5.687459576800633]
We curate a high-quality dataset consisting of 1.2TB of text in all of the major North Germanic languages.
This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.
arXiv Detail & Related papers (2023-03-30T06:42:22Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z) - IIRC: A Dataset of Incomplete Information Reading Comprehension
Questions [53.3193258414806]
We present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia.
The questions were written by crowd workers who did not have access to any of the linked documents.
We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset.
arXiv Detail & Related papers (2020-11-13T20:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.