MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification
- URL: http://arxiv.org/abs/2111.06458v1
- Date: Thu, 11 Nov 2021 20:55:58 GMT
- Title: MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification
- Authors: Ladislav Mo\v{s}ner, Old\v{r}ich Plchot, Luk\'a\v{s} Burget, Jan
\v{C}ernock\'y
- Abstract summary: We present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems.
It can be readily used also for experiments with dereverberation, denoising, and speech enhancement.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by unconsolidated data situation and the lack of a standard
benchmark in the field, we complement our previous efforts and present a
comprehensive corpus designed for training and evaluating text-independent
multi-channel speaker verification systems. It can be readily used also for
experiments with dereverberation, denoising, and speech enhancement. We tackled
the ever-present problem of the lack of multi-channel training data by
utilizing data simulation on top of clean parts of the Voxceleb dataset. The
development and evaluation trials are based on a retransmitted Voices Obscured
in Complex Environmental Settings (VOiCES) corpus, which we modified to provide
multi-channel trials. We publish full recipes that create the dataset from
public sources as the MultiSV corpus, and we provide results with two of our
multi-channel speaker verification systems with neural network-based
beamforming based either on predicting ideal binary masks or the more recent
Conv-TasNet.
Related papers
- LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization [31.01716151301142]
We present a large-scale far-field overlapping speech dataset to advance research in speech separation, recognition, and speaker diarization.
This dataset is a critical resource for decoding Who said What and When'' in multi-talker, reverberant environments.
arXiv Detail & Related papers (2024-09-01T19:23:08Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation [10.456845656569444]
Separation of multiple singing voices into each voice is rarely studied in music source separation research.
We introduce MedleyVox, an evaluation dataset for multiple singing voices separation.
We present a strategy for construction of multiple singing mixtures using various single-singing datasets.
arXiv Detail & Related papers (2022-11-14T12:27:35Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Exploring Deep Learning for Joint Audio-Visual Lip Biometrics [54.32039064193566]
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication.
The lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics.
We establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module.
arXiv Detail & Related papers (2021-04-17T10:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.