Related papers: Prak: An automatic phonetic alignment tool for Czech

Prak: An automatic phonetic alignment tool for Czech

URL: http://arxiv.org/abs/2304.08431v1
Date: Mon, 17 Apr 2023 16:51:24 GMT
Title: Prak: An automatic phonetic alignment tool for Czech
Authors: V\'aclav Han\v{z}l, Adl\'eta Han\v{z}lov\'a
Abstract summary: Free open-source tool generates phone sequences from Czech text and time-aligns them with audio. A Czech pronunciation generator is composed of simple rule-based blocks capturing the logic of the language.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Labeling speech down to the identity and time boundaries of phones is a labor-intensive part of phonetic research. To simplify this work, we created a free open-source tool generating phone sequences from Czech text and time-aligning them with audio. Low architecture complexity makes the design approachable for students of phonetics. Acoustic model ReLU NN with 56k weights was trained using PyTorch on small CommonVoice data. Alignment and variant selection decoder is implemented in Python with matrix library. A Czech pronunciation generator is composed of simple rule-based blocks capturing the logic of the language where possible, allowing modification of transcription approach details. Compared to tools used until now, data preparation efficiency improved, the tool is usable on Mac, Linux and Windows in Praat GUI or command line, achieves mostly correct pronunciation variant choice including glottal stop detection, algorithmically captures most of Czech assimilation logic and is both didactic and practical.

Related papers

Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit [0.0]
This work explores how custom language models with the open-source Vosk Toolkit can improve speech-to-text accuracy in varied settings. A Python-based transcription pipeline was developed to process input audio, perform speech recognition using Vosk's KaldiRecognizer, and export the output to a DOCX file. Results showed that custom models reduced word error rates, especially in domain-specific scenarios involving technical terminology, varied accents, or background noise.
arXiv Detail & Related papers (2025-03-26T22:20:48Z)
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%. In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition [14.024346215923972]
Code-Switching refers to the phenomenon of switching languages within a sentence or discourse. In this paper, we make use of limited code-switching data as driving materials and explore a shortcut to quickly develop intra-sentential code-switching recognition skill.
arXiv Detail & Related papers (2022-10-26T13:19:35Z)
Shennong: a Python toolbox for audio speech features extraction [15.816237141746562]
Shennong is a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters, pre-trained neural networks, pitch estimators and speaker normalization methods. This paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.
arXiv Detail & Related papers (2021-12-10T14:08:52Z)
"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications [4.619541348328938]
RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications. The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming. The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource.
arXiv Detail & Related papers (2020-09-11T15:26:55Z)
KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [1.7955614278088239]
KoSpeech is an end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch. We propose preprocessing methods for KsponSpeech corpus and a baseline model for benchmarks. Our baseline model achieved 10.31% character error rate (CER) at KsponSpeech corpus only with the acoustic model.
arXiv Detail & Related papers (2020-09-07T13:25:36Z)
Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models. We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z)
Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data. Our model is able to recognize unseen phonemes in the target language without any training data. It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.