CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition   Challenge
        - URL: http://arxiv.org/abs/2406.10313v1
 - Date: Fri, 14 Jun 2024 12:49:38 GMT
 - Title: CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition   Challenge
 - Authors: Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang, 
 - Abstract summary: The challenge yielded highly successful results, with the best submission significantly outperforming the baseline.
This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction.
 - Score: 12.178918299455898
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   The first Chinese Continuous Visual Speech Recognition Challenge aimed to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR) on two tasks: (1) Single-speaker VSR for a particular speaker and (2) Multi-speaker VSR for a set of registered speakers. The challenge yielded highly successful results, with the best submission significantly outperforming the baseline, particularly in the single-speaker task. This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction. It also summarises the representative techniques employed by the submitted systems, highlighting the most effective approaches. Additional information and resources about this challenge can be accessed through the official website at http://cnceleb.org/competition. 
 
       
      
        Related papers
        - Recent Trends in Distant Conversational Speech Recognition: A Review of   CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv  Detail & Related papers  (2025-07-24T07:56:24Z) - Triple X: A LLM-Based Multilingual Speech Recognition System for the   INTERSPEECH2025 MLC-SLM Challenge [24.966911190845817]
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge.<n>Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture.
arXiv  Detail & Related papers  (2025-07-23T07:48:33Z) - Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning   Language Models [4.917936997225074]
Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM)<n>We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR.
arXiv  Detail & Related papers  (2025-06-16T09:42:05Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive   Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv  Detail & Related papers  (2025-05-29T07:47:48Z) - SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual   Question Answering [0.0]
We introduce SViQA, a unified speech-vision model that processes spoken questions without text transcription.
Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations.
Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance.
arXiv  Detail & Related papers  (2025-04-01T07:15:32Z) - Investigation of Speaker Representation for Target-Speaker Speech   Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv  Detail & Related papers  (2024-10-15T03:58:13Z) - WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv  Detail & Related papers  (2024-03-31T12:01:32Z) - Summary of the DISPLACE Challenge 2023 -- DIarization of SPeaker and
  LAnguage in Conversational Environments [28.618333018398122]
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages.
Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers.
The DISPLACE challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition.
arXiv  Detail & Related papers  (2023-11-21T12:23:58Z) - Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
  Recognition and Active Speaker Detection [9.914246432182873]
In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face.
Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio.
Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces.
This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
arXiv  Detail & Related papers  (2022-05-10T23:03:19Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv  Detail & Related papers  (2021-10-26T17:55:19Z) - Accented Speech Recognition: A Survey [0.0]
We present a survey of current promising approaches to accented speech recognition.
The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR.
arXiv  Detail & Related papers  (2021-04-21T20:21:06Z) - Investigating on Incorporating Pretrained and Learnable Speaker
  Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv  Detail & Related papers  (2021-03-06T10:14:33Z) - VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge [99.82500204110015]
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or in the wild' data.
This paper outlines the challenge, and describes the baselines, methods used, and results.
arXiv  Detail & Related papers  (2020-12-12T17:20:57Z) - CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for
  Unsegmented Recordings [87.37967358673252]
We organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6)
The challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition.
This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition and unsegmented multispeaker speech recognition.
arXiv  Detail & Related papers  (2020-04-20T12:59:07Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.