Related papers: Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Related papers

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark [10.018089141563104]
NorEval consists of 24 high-quality human-created datasets. It covers a broad spectrum of task categories targeting Norwegian language understanding and generation. It focuses on both of the official written standards of the Norwegian language: Bokmaal and Nynorsk.
arXiv Detail & Related papers (2025-04-10T13:44:55Z)
A Collection of Question Answering Datasets for Norwegian [6.149436325733799]
The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Our datasets comprise over 10k question-answer pairs, created by native speakers. Most LMs perform better in Bokmaal than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions.
arXiv Detail & Related papers (2025-01-19T17:42:48Z)
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively. However, Whisper struggles with unseen languages, those not included in its pre-training. We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
Small Languages, Big Models: A Study of Continual Training on Languages of Norway [11.548845014405984]
Training large language models requires vast amounts of data. We present a novel three-stage continual training approach that substantially improves the downstream performance. We release a new generative language model for Norwegian Bokmral, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
arXiv Detail & Related papers (2024-12-09T13:34:23Z)
Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space. In this work, we test this hypothesis by zero-shot translating from unseen languages. We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z)
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding. There is no publicly available NLI corpus for the Romanian language. We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z)
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks. To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models. We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z)
Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z)
NorQuAD: Norwegian Question Answering Dataset [0.03281128493853064]
The dataset consists of 4,752 manually created question-answer pairs. We benchmark several multilingual and Norwegian monolingual language models on the dataset and compare them against human performance. The dataset will be made freely available.
arXiv Detail & Related papers (2023-05-03T08:17:07Z)
NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian. It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z)
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing. Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z)
Multilingual Unsupervised Neural Machine Translation with Denoising Adapters [77.80790405710819]
We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data. For this problem the standard procedure so far to leverage the monolingual data is back-translation, which is computationally costly and hard to tune. In this paper we propose instead to use denoising adapters, adapter layers with a denoising objective, on top of pre-trained mBART-50.
arXiv Detail & Related papers (2021-10-20T10:18:29Z)
Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z)
Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language. Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English. We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.