A Collection of Question Answering Datasets for Norwegian
- URL: http://arxiv.org/abs/2501.11128v1
- Date: Sun, 19 Jan 2025 17:42:48 GMT
- Title: A Collection of Question Answering Datasets for Norwegian
- Authors: Vladislav Mikhailov, Petter Mæhlum, Victoria Ovedie Chruickshank Langø, Erik Velldal, Lilja Øvrelid,
- Abstract summary: The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway.
Our datasets comprise over 10k question-answer pairs, created by native speakers.
Most LMs perform better in Bokmaal than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions.
- Score: 6.149436325733799
- License:
- Abstract: This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokm{\aa}l and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokm{\aa}l than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
Related papers
- Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles [8.083472758337559]
We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian.
The dataset is intended for benchmarking the abstractive summarisation capabilities of generative language models.
arXiv Detail & Related papers (2025-01-13T22:08:29Z) - Small Languages, Big Models: A Study of Continual Training on Languages of Norway [11.548845014405984]
Training large language models requires vast amounts of data.
We present a novel three-stage continual training approach that substantially improves the downstream performance.
We release a new generative language model for Norwegian Bokmral, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
arXiv Detail & Related papers (2024-12-09T13:34:23Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.
We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk.
We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets.
We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z) - NorQuAD: Norwegian Question Answering Dataset [0.03281128493853064]
The dataset consists of 4,752 manually created question-answer pairs.
We benchmark several multilingual and Norwegian monolingual language models on the dataset and compare them against human performance.
The dataset will be made freely available.
arXiv Detail & Related papers (2023-05-03T08:17:07Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z) - IIRC: A Dataset of Incomplete Information Reading Comprehension
Questions [53.3193258414806]
We present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia.
The questions were written by crowd workers who did not have access to any of the linked documents.
We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset.
arXiv Detail & Related papers (2020-11-13T20:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.