Finnish Parliament ASR corpus - Analysis, benchmarks and statistics
- URL: http://arxiv.org/abs/2203.14876v1
- Date: Mon, 28 Mar 2022 16:29:49 GMT
- Title: Finnish Parliament ASR corpus - Analysis, benchmarks and statistics
- Authors: Anja Virkkunen and Aku Rouhe and Nhan Phan and Mikko Kurimo
- Abstract summary: The Finnish parliament is the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers.
This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time.
We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes.
- Score: 11.94655679070282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Public sources like parliament meeting recordings and transcripts provide
ever-growing material for the training and evaluation of automatic speech
recognition (ASR) systems. In this paper, we publish and analyse the Finnish
parliament ASR corpus, the largest publicly available collection of manually
transcribed speech data for Finnish with over 3000 hours of speech and 449
speakers for which it provides rich demographic metadata. This corpus builds on
earlier initial work, and as a result the corpus has a natural split into two
training subsets from two periods of time. Similarly, there are two official,
corrected test sets covering different times, setting an ASR task with
longitudinal distribution-shift characteristics. An official development set is
also provided. We develop a complete Kaldi-based data preparation pipeline, and
hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and
attention-based encoder-decoder (AED) ASR recipes. We set benchmarks on the
official test sets, as well as multiple other recently used test sets. Both
temporal corpus subsets are already large, and we observe that beyond their
scale, ASR performance on the official test sets plateaus, whereas other
domains benefit from added data. The HMM-DNN and AED approaches are compared in
a carefully matched equal data setting, with the HMM-DNN system consistently
performing better. Finally, the variation of the ASR accuracy is compared
between the speaker categories available in the parliament metadata to detect
potential biases based on factors such as gender, age, and education.
Related papers
- ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - ASR in German: A Detailed Error Analysis [0.0]
This work presents a selection of ASR model architectures that are pretrained on the German language and evaluates them on a benchmark of diverse test datasets.
It identifies cross-architectural prediction errors, classifies those into categories and traces the sources of errors per category back into training data.
arXiv Detail & Related papers (2022-04-12T08:25:01Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - The Norwegian Parliamentary Speech Corpus [0.5874142059884521]
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech dataset with recordings of meetings from Stortinget, the Norwegian parliament.
It is the first, publicly available dataset containing unscripted, Norwegian speech designed for training of automatic speech recognition (ASR) systems.
Training on the NPSC is shown to have a "democratizing" effect in terms of dialects, as improvements are generally larger for dialects with higher WER from the baseline system.
arXiv Detail & Related papers (2022-01-26T11:41:55Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR [10.261890123213622]
We propose an on-the-fly data augmentation method for automatic speech recognition (ASR)
Our method, called Aligned Data Augmentation (ADA) for ASR, replaces transcribed tokens and the speech representations in an aligned manner to generate training pairs.
arXiv Detail & Related papers (2021-04-03T13:00:00Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z) - Binary and Multitask Classification Model for Dutch Anaphora Resolution:
Die/Dat Prediction [18.309099448064273]
correct use of Dutch pronouns 'die' and 'dat' is a stumbling block for both native and non-native speakers of Dutch.
This study constructs the first neural network model for Dutch demonstrative and relative pronoun resolution.
arXiv Detail & Related papers (2020-01-09T12:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.