Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks
- URL: http://arxiv.org/abs/2203.12906v1
- Date: Thu, 24 Mar 2022 07:50:25 GMT
- Title: Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks
- Authors: Anssi Moisio, Dejan Porjazovski, Aku Rouhe, Yaroslav Getman, Anja
Virkkunen, Tam\'as Gr\'osz, Krister Lind\'en and Mikko Kurimo
- Abstract summary: The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
- Score: 9.160401226886947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Donate Speech campaign has so far succeeded in gathering approximately
3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta
(Donate Speech) corpus. The corpus includes over twenty thousand speakers from
all the regions of Finland and from all age brackets. The primary goals of the
collection were to create a representative, large-scale resource to study
spontaneous spoken Finnish and to accelerate the development of language
technology and speech-based services. In this paper, we present the collection
process and the collected corpus, and showcase its versatility through multiple
use cases. The evaluated use cases include: automatic speech recognition of
spontaneous speech, detection of age, gender, dialect and topic and metadata
analysis. We provide benchmarks for the use cases, as well down loadable,
trained baseline systems with open-source code for reproducibility. One further
use case is to verify the metadata and transcripts given in this corpus itself,
and to suggest artificial metadata and transcripts for the part of the corpus
where it is missing.
Related papers
- The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language [4.077418516695122]
Faetar has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark.
The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions.
We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%.
arXiv Detail & Related papers (2024-09-12T14:55:33Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Annotated Speech Corpus for Low Resource Indian Languages: Awadhi,
Bhojpuri, Braj and Magahi [2.84214511742034]
We develop a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi.
The total size of the corpus currently stands at approximately 18 hours.
We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic.
arXiv Detail & Related papers (2022-06-26T17:28:38Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Creating Speech-to-Speech Corpus from Dubbed Series [8.21384946488751]
We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels.
Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages.
Our pipeline was able to generate 17 hours of paired segments, which is about 47% of the corpus.
arXiv Detail & Related papers (2022-03-07T18:52:48Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.