NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant
Meeting Transcription
- URL: http://arxiv.org/abs/2401.08887v1
- Date: Tue, 16 Jan 2024 23:50:26 GMT
- Title: NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant
Meeting Transcription
- Authors: Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi,
Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki
Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit
Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
- Abstract summary: We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1'') Challenge alongside datasets and baseline system.
The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios.
- Score: 21.236634241186458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the first Natural Office Talkers in Settings of Far-field Audio
Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system.
The challenge focuses on distant speaker diarization and automatic speech
recognition (DASR) in far-field meeting scenarios, with single-channel and
known-geometry multi-channel tracks, and serves as a launch platform for two
new datasets: First, a benchmarking dataset of 315 meetings, averaging 6
minutes each, capturing a broad spectrum of real-world acoustic conditions and
conversational dynamics. It is recorded across 30 conference rooms, featuring
4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated
training dataset, synthesized with enhanced authenticity for real-world
generalization, incorporating 15,000 real acoustic transfer functions. The
tasks focus on single-device DASR, where multi-channel devices always share the
same known geometry. This is aligned with common setups in actual conference
rooms, and avoids technical complexities associated with multi-device tasks. It
also allows for the development of geometry-specific solutions. The NOTSOFAR-1
Challenge aims to advance research in the field of distant conversational
speech recognition, providing key resources to unlock the potential of
data-driven methods, which we believe are currently constrained by the absence
of comprehensive high-quality training and benchmarking datasets.
Related papers
- LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization [31.01716151301142]
We present a large-scale far-field overlapping speech dataset to advance research in speech separation, recognition, and speaker diarization.
This dataset is a critical resource for decoding Who said What and When'' in multi-talker, reverberant environments.
arXiv Detail & Related papers (2024-09-01T19:23:08Z) - The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data [28.23517306589778]
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains.
There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas.
High-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset.
arXiv Detail & Related papers (2024-03-21T00:13:59Z) - The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in
CNVSRC 2023 [67.11294606070278]
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
arXiv Detail & Related papers (2024-01-07T14:20:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple
Devices in Diverse Scenarios [61.74042680711718]
We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge.
This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices.
The goal is for participants to devise a single system that can generalize across different array geometries.
arXiv Detail & Related papers (2023-06-23T18:49:20Z) - ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social
Interactions In-the-Wild [10.686716372324096]
We describe an instantiation of a new concept for multimodal multisensor data collection of real life in-the-wild free standing social interactions.
ConfLab contains high fidelity data of 49 people during a real-life professional networking event.
arXiv Detail & Related papers (2022-05-10T21:30:10Z) - Training speaker recognition systems with limited data [2.3148470932285665]
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work.
We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset.
We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
arXiv Detail & Related papers (2022-03-28T12:41:41Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting
Transcription with Single Distant Microphone [43.77139614544301]
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR)
In this paper, we extensively investigate a two-step approach where we first pre-train a serialized output training (SOT)-based multi-talker ASR.
With fine-tuning on the 70 hours of the AMI-SDM training data, our SOT ASR model achieves a word error rate (WER) of 21.2% for the AMI-SDM evaluation set.
arXiv Detail & Related papers (2021-03-31T02:43:32Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.