CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for
Unsegmented Recordings
- URL: http://arxiv.org/abs/2004.09249v2
- Date: Sat, 2 May 2020 11:04:49 GMT
- Title: CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for
Unsegmented Recordings
- Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish
Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh
Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair,
Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki
Kanda, Takuya Yoshioka, Neville Ryant
- Abstract summary: We organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6)
The challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition.
This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition and unsegmented multispeaker speech recognition.
- Score: 87.37967358673252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we
organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
The new challenge revisits the previous CHiME-5 challenge and further considers
the problem of distant multi-microphone conversational speech diarization and
recognition in everyday home environments. Speech material is the same as the
previous CHiME-5 recordings except for accurate array synchronization. The
material was elicited using a dinner party scenario with efforts taken to
capture data that is representative of natural conversational speech. This
paper provides a baseline description of the CHiME-6 challenge for both
segmented multispeaker speech recognition (Track 1) and unsegmented
multispeaker speech recognition (Track 2). Of note, Track 2 is the first
challenge activity in the community to tackle an unsegmented multispeaker
speech recognition scenario with a complete set of reproducible open source
baselines providing speech enhancement, speaker diarization, and speech
recognition modules.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Voxtlm: unified decoder-only models for consolidating speech
recognition/synthesis and speech/text continuation tasks [61.3055230762097]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning.
arXiv Detail & Related papers (2023-09-14T03:13:18Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.