BEA-Base: A Benchmark for ASR of Spontaneous Hungarian
- URL: http://arxiv.org/abs/2202.00601v1
- Date: Tue, 1 Feb 2022 17:45:22 GMT
- Title: BEA-Base: A Benchmark for ASR of Spontaneous Hungarian
- Authors: P. Mihajlik, A. Balog, T. E. Gr\'aczi, A. Koh\'ari, B. Tarj\'an and K.
M\'ady
- Abstract summary: We introduce BEA-Base, a subset of the BEA spoken Hungarian database comprising mostly spontaneous speech of 140 speakers.
It is built specifically to assess Automatic Speech Recognition, primarily for conversational AI applications.
Results show the feasibility of using BEA-Base for training and evaluation of Hungarian speech recognition systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hungarian is spoken by 15 million people, still, easily accessible Automatic
Speech Recognition (ASR) benchmark datasets - especially for spontaneous speech
- have been practically unavailable. In this paper, we introduce BEA-Base, a
subset of the BEA spoken Hungarian database comprising mostly spontaneous
speech of 140 speakers. It is built specifically to assess ASR, primarily for
conversational AI applications. After defining the speech recognition subsets
and task, several baselines - including classic HMM-DNN hybrid and end-to-end
approaches augmented by cross-language transfer learning - are developed using
open-source toolkits. The best results obtained are based on multilingual
self-supervised pretraining, achieving a 45% recognition error rate reduction
as compared to the classical approach - without the application of an external
language model or additional supervised data. The results show the feasibility
of using BEA-Base for training and evaluation of Hungarian speech recognition
systems.
Related papers
- PRiSM: Benchmarking Phone Realization in Speech Models [70.82595415252682]
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis.<n>We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception.
arXiv Detail & Related papers (2026-01-20T15:00:36Z) - Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets [1.3575563075756973]
We introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA.<n> BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata.<n> BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets.
arXiv Detail & Related papers (2025-11-17T16:02:08Z) - Efficient Multilingual ASR Finetuning via LoRA Language Experts [59.27778147311189]
This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper.<n>Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods.<n> Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios.
arXiv Detail & Related papers (2025-06-11T07:06:27Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.
This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - A Novel Self-training Approach for Low-resource Speech Recognition [15.612232220719653]
We propose a self-training approach for automatic speech recognition (ASR) for low-resource settings.
Our approach significantly improves word error rate, achieving a relative improvement of 14.94%.
Our proposed approach reports the best results on the Common Voice Punjabi dataset.
arXiv Detail & Related papers (2023-08-10T01:02:45Z) - Towards hate speech detection in low-resource languages: Comparing ASR
to acoustic word embeddings on Wolof and Swahili [16.424308444697015]
We consider hate speech detection through keyword spotting on radio broadcasts.
One approach is to build an automatic speech recognition system for the target low-resource language.
We compare this to using acoustic word embedding models that map speech segments to a space where matching words have similar vectors.
arXiv Detail & Related papers (2023-06-01T07:25:10Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters [31.705705891482594]
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages.
We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads.
We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages.
arXiv Detail & Related papers (2020-07-06T18:43:38Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Streaming Language Identification using Combination of Acoustic
Representations and ASR Hypotheses [13.976935216584298]
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel.
We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses.
To reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early.
arXiv Detail & Related papers (2020-06-01T04:08:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.