Related papers: Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

URL: http://arxiv.org/abs/2511.03361v1
Date: Wed, 05 Nov 2025 11:02:16 GMT
Title: Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Authors: Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu,
Abstract summary: We present a new Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture.<n>We train our model on a large corpus of weakly supervised transcriptions, totaling over 2,600 hours of speech.<n>Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks.
Score: 47.27624927463166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.

Related papers

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics [32.41110835446445]
In machine translation, neural metrics incorporating the source text achieve stronger correlation with human judgments.<n>In this work, we conduct the first systematic study of source-aware metrics for speech-to-text translation.<n>We introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations.
arXiv Detail & Related papers (2025-11-05T08:49:22Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.<n>To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.<n>Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Strategies for improving low resource speech to text translation relying on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST) We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z)
Large scale weakly and semi-supervised learning for low-resource video ASR [32.33625853364696]
We compare self-labeling and weakly-supervised pretraining approaches for transcribing social media videos. We find that sequence-level distillation for encoder-decoder models provides the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
arXiv Detail & Related papers (2020-05-16T03:08:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.