The Newsbridge -Telecom SudParis VoxCeleb Speaker Recognition Challenge
2022 System Description
- URL: http://arxiv.org/abs/2301.07491v1
- Date: Tue, 17 Jan 2023 15:52:39 GMT
- Title: The Newsbridge -Telecom SudParis VoxCeleb Speaker Recognition Challenge
2022 System Description
- Authors: Yannis Tevissen (ARMEDIA-SAMOVAR), J\'er\^ome Boudy (ARMEDIA-SAMOVAR),
Fr\'ed\'eric Petitpont
- Abstract summary: We describe the system used by our team for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC 2022) in the speaker diarization track.
Our solution was designed around a new combination of voice activity detection algorithms that uses the strengths of several systems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe the system used by our team for the VoxCeleb Speaker Recognition
Challenge 2022 (VoxSRC 2022) in the speaker diarization track. Our solution was
designed around a new combination of voice activity detection algorithms that
uses the strengths of several systems. We introduce a novel multi stream
approach with a decision protocol based on classifiers entropy. We called this
method a multi-stream voice activity detection and used it with standard
baseline diarization embeddings, clustering and resegmentation. With this work,
we successfully demonstrated that using a strong baseline and working only on
voice activity detection, one can achieved close to state-of-theart results.
Related papers
- Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 [8.940008511570207]
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
arXiv Detail & Related papers (2024-09-03T21:28:45Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Speaker Recognition in Realistic Scenario Using Multimodal Data [4.373374186532439]
We propose a two-branch network to learn joint representations of faces and voices in a multimodal system.
We evaluate our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$.
arXiv Detail & Related papers (2023-02-25T09:11:09Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker
Diarisation Challenge [6.6238321827660345]
This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020.
Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals.
arXiv Detail & Related papers (2020-10-22T12:42:07Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.