Whispering in Norwegian: Navigating Orthographic and Dialectic
Challenges
- URL: http://arxiv.org/abs/2402.01917v1
- Date: Fri, 2 Feb 2024 21:38:12 GMT
- Title: Whispering in Norwegian: Navigating Orthographic and Dialectic
Challenges
- Authors: Per E Kummervold, Javier de la Rosa, Freddy Wetjen, Rolv-Arild
Braaten, Per Erik Solberg
- Abstract summary: This article introduces NB-Whisper, an adaptation of OpenAI's Whisper, specifically fine-tuned for Norwegian language Automatic Speech Recognition (ASR)
We highlight its key contributions and summarise the results achieved in converting spoken Norwegian into written forms and translating other languages into Norwegian.
- Score: 0.2984347156162651
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This article introduces NB-Whisper, an adaptation of OpenAI's Whisper,
specifically fine-tuned for Norwegian language Automatic Speech Recognition
(ASR). We highlight its key contributions and summarise the results achieved in
converting spoken Norwegian into written forms and translating other languages
into Norwegian. We show that we are able to improve the Norwegian Bokm{\aa}l
transcription by OpenAI Whisper Large-v3 from a WER of 10.4 to 6.6 on the
Fleurs Dataset and from 6.8 to 2.2 on the NST dataset.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk.
We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets.
We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z) - NorQuAD: Norwegian Question Answering Dataset [0.03281128493853064]
The dataset consists of 4,752 manually created question-answer pairs.
We benchmark several multilingual and Norwegian monolingual language models on the dataset and compare them against human performance.
The dataset will be made freely available.
arXiv Detail & Related papers (2023-05-03T08:17:07Z) - NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Multilingual Unsupervised Neural Machine Translation with Denoising
Adapters [77.80790405710819]
We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data.
For this problem the standard procedure so far to leverage the monolingual data is back-translation, which is computationally costly and hard to tune.
In this paper we propose instead to use denoising adapters, adapter layers with a denoising objective, on top of pre-trained mBART-50.
arXiv Detail & Related papers (2021-10-20T10:18:29Z) - Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.