Related papers: Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems

Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems

URL: http://arxiv.org/abs/2410.02538v1
Date: Thu, 3 Oct 2024 14:43:43 GMT
Title: Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems
Authors: Olga Iakovenko, Ivan Bondarenko, Mariya Borovikova, Daniil Vodolazsky,
Abstract summary: This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts. Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on "Grammatical dictionary of the Russian language" of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik "Computer Synthesis and Voice Cloning". The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.

Related papers

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation [4.314729314139958]
We propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair.<n>We employ a decoding strategy that utilizes dictionary prior knowledge to correct errors in phonemic labeling.<n>The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.
arXiv Detail & Related papers (2025-06-09T11:10:24Z)
Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing [17.333427709985376]
We propose a data-driven method to automatically acquire pronunciation correlations, called automatic text pronunciation correlation (ATPC) Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing.
arXiv Detail & Related papers (2025-01-01T11:10:46Z)
Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition. The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z)
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning [0.76146285961466]
We present a methodology for linguistic feature extraction, focusing on automatically syllabifying words in multiple languages. In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification. The system was built with open-source components and resources.
arXiv Detail & Related papers (2023-10-17T19:27:23Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T) We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z)
Speech Aware Dialog System Technology Challenge (DSTC11) [12.841429336655736]
Most research on task oriented dialog modeling is based on written text input. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs.
arXiv Detail & Related papers (2022-12-16T20:30:33Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems. We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.