Related papers: CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

URL: http://arxiv.org/abs/2406.10313v1
Date: Fri, 14 Jun 2024 12:49:38 GMT
Title: CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge
Authors: Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang,
Abstract summary: The challenge yielded highly successful results, with the best submission significantly outperforming the baseline. This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction.
Score: 12.178918299455898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The first Chinese Continuous Visual Speech Recognition Challenge aimed to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR) on two tasks: (1) Single-speaker VSR for a particular speaker and (2) Multi-speaker VSR for a set of registered speakers. The challenge yielded highly successful results, with the best submission significantly outperforming the baseline, particularly in the single-speaker task. This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction. It also summarises the representative techniques employed by the submitted systems, highlighting the most effective approaches. Additional information and resources about this challenge can be accessed through the official website at http://cnceleb.org/competition.

Related papers

SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [0.0]
We introduce SViQA, a unified speech-vision model that processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance.
arXiv Detail & Related papers (2025-04-01T07:15:32Z)
Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks? For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z)
WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z)
Summary of the DISPLACE Challenge 2023 -- DIarization of SPeaker and LAnguage in Conversational Environments [28.618333018398122]
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition.
arXiv Detail & Related papers (2023-11-21T12:23:58Z)
Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection [9.914246432182873]
In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face. Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
arXiv Detail & Related papers (2022-05-10T23:03:19Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
Accented Speech Recognition: A Survey [0.0]
We present a survey of current promising approaches to accented speech recognition. The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR.
arXiv Detail & Related papers (2021-04-21T20:21:06Z)
Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z)
VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge [99.82500204110015]
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or in the wild' data. This paper outlines the challenge, and describes the baselines, methods used, and results.
arXiv Detail & Related papers (2020-12-12T17:20:57Z)
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings [87.37967358673252]
We organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6) The challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition and unsegmented multispeaker speech recognition.
arXiv Detail & Related papers (2020-04-20T12:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.