Romanian Speech Recognition Experiments from the ROBIN Project
- URL: http://arxiv.org/abs/2111.12028v1
- Date: Tue, 23 Nov 2021 17:35:00 GMT
- Title: Romanian Speech Recognition Experiments from the ROBIN Project
- Authors: Andrei-Marius Avram, Vasile P\u{a}i\c{s}, Dan Tufi\c{s}
- Abstract summary: This paper presents different speech recognition experiments with deep neural networks focusing on producing fast (under 100ms latency from the network itself)
Even though one of the key desired characteristics is low latency, the final deep neural network model achieves state of the art results for recognizing Romanian language.
- Score: 0.21485350418225244
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: One of the fundamental functionalities for accepting a socially assistive
robot is its communication capabilities with other agents in the environment.
In the context of the ROBIN project, situational dialogue through voice
interaction with a robot was investigated. This paper presents different speech
recognition experiments with deep neural networks focusing on producing fast
(under 100ms latency from the network itself), while still reliable models.
Even though one of the key desired characteristics is low latency, the final
deep neural network model achieves state of the art results for recognizing
Romanian language, obtaining a 9.91% word error rate (WER), when combined with
a language model, thus improving over the previous results while offering at
the same time an improved runtime performance. Additionally, we explore two
modules for correcting the ASR output (hyphen and capitalization restoration
and unknown words correction), targeting the ROBIN project's goals (dialogue in
closed micro-worlds). We design a modular architecture based on APIs allowing
an integration engine (either in the robot or external) to chain together the
available modules as needed. Finally, we test the proposed design by
integrating it in the RELATE platform and making the ASR service available to
web users by either uploading a file or recording new speech.
Related papers
- RoboNeuron: A Modular Framework Linking Foundation Models and ROS for Embodied AI [13.74517467087138]
RoboNeuron is a universal deployment framework for embodied intelligence.<n>It is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS)
arXiv Detail & Related papers (2025-12-11T07:58:19Z) - Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency.
We show that 1-best results of our model are on par with a comparable RNN-T system.
We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - SAN: a robust end-to-end ASR model architecture [0.0]
Siamese Adversarial Network (SAN) architecture for automatic speech recognition.
SAN constructs two sub-networks to differentiate the audio feature input and then introduces a loss to unify the output distribution of these sub-networks.
We conduct numerical experiments with the SAN model on several datasets for the automatic speech recognition task.
arXiv Detail & Related papers (2022-10-27T09:36:25Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.