Open Source State-Of-the-Art Solution for Romanian Speech Recognition
- URL: http://arxiv.org/abs/2511.03361v1
- Date: Wed, 05 Nov 2025 11:02:16 GMT
- Title: Open Source State-Of-the-Art Solution for Romanian Speech Recognition
- Authors: Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu,
- Abstract summary: We present a new Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture.<n>We train our model on a large corpus of weakly supervised transcriptions, totaling over 2,600 hours of speech.<n>Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks.
- Score: 47.27624927463166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.
Related papers
- How to Evaluate Speech Translation with Source-Aware Neural MT Metrics [32.41110835446445]
In machine translation, neural metrics incorporating the source text achieve stronger correlation with human judgments.<n>In this work, we conduct the first systematic study of source-aware metrics for speech-to-text translation.<n>We introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations.
arXiv Detail & Related papers (2025-11-05T08:49:22Z) - Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.<n>To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.<n>Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z) - Large scale weakly and semi-supervised learning for low-resource video
ASR [32.33625853364696]
We compare self-labeling and weakly-supervised pretraining approaches for transcribing social media videos.
We find that sequence-level distillation for encoder-decoder models provides the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
arXiv Detail & Related papers (2020-05-16T03:08:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.