Related papers: Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

Related papers

Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv Detail & Related papers (2025-06-02T13:46:02Z)
Exploring Generative Error Correction for Dysarthric Speech Recognition [12.584296717901116]
We propose a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025.<n>We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy.<n>We provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition.
arXiv Detail & Related papers (2025-05-26T16:06:31Z)
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models [49.725968706743586]
WavRAG is the first retrieval augmented generation framework with native, end-to-end audio support. We propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration.
arXiv Detail & Related papers (2025-02-20T16:54:07Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition [26.693942793501204]
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting.
arXiv Detail & Related papers (2024-06-04T16:59:11Z)
Automatic Speech Recognition Advancements for Indigenous Languages of the Americas [0.0]
The Second Americas (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. We describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z)
Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z)
Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge [11.263392524468625]
The system focuses on adapting ASR models for low-resource Indian languages. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks.
arXiv Detail & Related papers (2023-07-20T00:55:01Z)
VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge [95.6159736804855]
The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22) was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild"
arXiv Detail & Related papers (2023-02-20T19:27:14Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years. Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect. This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z)
Sentiment-Aware Automatic Speech Recognition pre-training for enhanced Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER) We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks. We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z)
On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z)
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.