Using Kaldi for Automatic Speech Recognition of Conversational Austrian
German
- URL: http://arxiv.org/abs/2301.06475v1
- Date: Mon, 16 Jan 2023 15:28:28 GMT
- Title: Using Kaldi for Automatic Speech Recognition of Conversational Austrian
German
- Authors: Julian Linke, Saskia Wepner, Gernot Kubin and Barbara Schuppler
- Abstract summary: This paper presents ASR experiments with read and conversational Austrian German as target.
We improve a Kaldi-based ASR system by incorporating a knowledge-based pronunciation lexicon.
We achieve best WER of 0.4% on Austrian German read speech and best average WER of 48.5% on conversational speech.
- Score: 5.887969742827489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As dialogue systems are becoming more and more interactional and social, also
the accurate automatic speech recognition (ASR) of conversational speech is of
increasing importance. This shifts the focus from short, spontaneous,
task-oriented dialogues to the much higher complexity of casual face-to-face
conversations. However, the collection and annotation of such conversations is
a time-consuming process and data is sparse for this specific speaking style.
This paper presents ASR experiments with read and conversational Austrian
German as target. In order to deal with having only limited resources available
for conversational German and, at the same time, with a large variation among
speakers with respect to pronunciation characteristics, we improve a
Kaldi-based ASR system by incorporating a (large) knowledge-based pronunciation
lexicon, while exploring different data-based methods to restrict the number of
pronunciation variants for each lexical entry. We achieve best WER of 0.4% on
Austrian German read speech and best average WER of 48.5% on conversational
speech. We find that by using our best pronunciation lexicon a similarly high
performance can be achieved than by increasing the size of the data used for
the language model by approx. 360% to 760%. Our findings indicate that for
low-resource scenarios -- despite the general trend in speech technology
towards using data-based methods only -- knowledge-based approaches are a
successful, efficient method.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - CKERC : Joint Large Language Models with Commonsense Knowledge for
Emotion Recognition in Conversation [0.0]
Emotion recognition in conversation (ERC) is a task which predicts the emotion of an utterance in the context of a conversation.
We propose a novel joint large language models with commonsense knowledge framework for emotion recognition in conversation, namely CKERC.
arXiv Detail & Related papers (2024-03-12T02:37:11Z) - Evaluation of Automated Speech Recognition Systems for Conversational
Speech: A Linguistic Perspective [0.0]
We take a linguistic perspective, and take the French language as a case study toward disambiguation of the French homophones.
Our contribution aims to provide more insight into human speech transcription accuracy in conditions to reproduce those of state-of-the-art ASR systems.
arXiv Detail & Related papers (2022-11-05T04:35:40Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.