Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech
to Standard German Text Corpus
- URL: http://arxiv.org/abs/2010.02810v2
- Date: Wed, 9 Jun 2021 11:47:40 GMT
- Title: Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech
to Standard German Text Corpus
- Authors: Michel Pl\"uss and Lukas Neukom and Christian Scheller and Manfred
Vogel
- Abstract summary: This first version of the corpus is based on publicly available data of the Bernese cantonal parliament and consists of 293 hours of data.
It was created using a novel forced sentence alignment procedure and an alignment quality estimator.
We trained Automatic Speech Recognition (ASR) models as baselines on different subsets of the data and achieved a Word Error Rate (WER) of 0.278 and a BLEU score of 0.586 on the SPC test set.
- Score: 2.610806620660055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Swiss Parliaments Corpus (SPC), an automatically aligned Swiss
German speech to Standard German text corpus. This first version of the corpus
is based on publicly available data of the Bernese cantonal parliament and
consists of 293 hours of data. It was created using a novel forced sentence
alignment procedure and an alignment quality estimator, which can be used to
trade off corpus size and quality. We trained Automatic Speech Recognition
(ASR) models as baselines on different subsets of the data and achieved a Word
Error Rate (WER) of 0.278 and a BLEU score of 0.586 on the SPC test set. The
corpus is freely available for download.
Related papers
- Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions [5.6787416472329495]
We present STT4SG-350 (Speech-to-Text for Swiss German), a corpus of Swiss German speech annotated with Standard German text at the sentence level.
The data is collected using a web app in which the speakers are shown Standard German sentences, which they translate to Swiss German and record.
It contains 343 hours of speech from all dialect regions and is the largest public speech corpus for Swiss German to date.
arXiv Detail & Related papers (2023-05-30T08:49:38Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - A New Aligned Simple German Corpus [2.7981463795578927]
We present a new sentence-aligned monolingual corpus for Simple German -- German.
It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods.
The quality of our sentence alignments, as measured by F1-score, surpasses previous work.
arXiv Detail & Related papers (2022-09-02T15:14:04Z) - SDS-200: A Swiss German Speech to Standard German Text Corpus [5.370317759946287]
We present SDS-200, a corpus of Swiss German dialectal speech with Standard German text translations.
The data was collected using a web recording tool that is open to the public.
The data consists of 200 hours of speech by around 4000 different speakers and covers a large part of the Swiss-German dialect landscape.
arXiv Detail & Related papers (2022-05-19T12:16:29Z) - German Parliamentary Corpus (GerParCor) [63.17616047204443]
We introduce the German Parliament Corpus (GerParCor)
GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries.
All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
arXiv Detail & Related papers (2022-04-21T22:06:55Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z) - FT Speech: Danish Parliament Speech Corpus [21.190182627955817]
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament.
The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.
It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish.
arXiv Detail & Related papers (2020-05-25T19:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.