Related papers: The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

URL: http://arxiv.org/abs/2505.13971v2
Date: Tue, 27 May 2025 05:03:46 GMT
Title: The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
Authors: Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Siniscalchi Sabato Marco, Odette Scharenborg,
Abstract summary: The MISP 2025 Challenge focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio.<n>The best-performing systems achieved significant improvements over the baseline.
Score: 95.95622220065884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

Related papers

The Interspeech 2025 Speech Accessibility Project Challenge [35.902086799949345]
2025 Interspeech Speech Accessibility Project launched, utilizing over 400 hours of SAP data collected and transcribed from more than 500 individuals with diverse speech disabilities.<n>12 out of 22 valid teams outperformed the whisper-large-v2 baseline in terms of Word Error Rate and Semantic Score.<n>The top team achieved the lowest WER of 8.11%, and the highest SemScore of 88.44% at the same time, setting new benchmarks for future ASR systems in recognizing impaired speech.
arXiv Detail & Related papers (2025-07-29T17:50:59Z)
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z)
Cocktail-Party Audio-Visual Speech Recognition [58.222892601847924]
This study introduces a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems.<n>We contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments.<n>Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
arXiv Detail & Related papers (2025-06-02T19:07:51Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper. Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z)
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z)
Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT) On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data. Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z)
The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge [6.6238321827660345]
This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020. Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals.
arXiv Detail & Related papers (2020-10-22T12:42:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.