Improved Relation Networks for End-to-End Speaker Verification and
Identification
- URL: http://arxiv.org/abs/2203.17218v1
- Date: Thu, 31 Mar 2022 17:44:04 GMT
- Title: Improved Relation Networks for End-to-End Speaker Verification and
Identification
- Authors: Ashutosh Chaubey, Sparsh Sinha, Susmita Ghose
- Abstract summary: Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speaker identification systems in a real-world scenario are tasked to
identify a speaker amongst a set of enrolled speakers given just a few samples
for each enrolled speaker. This paper demonstrates the effectiveness of
meta-learning and relation networks for this use case. We propose improved
relation networks for speaker verification and few-shot (unseen) speaker
identification. The use of relation networks facilitates joint training of the
frontend speaker encoder and the backend model. Inspired by the use of
prototypical networks in speaker verification and to increase the
discriminability of the speaker embeddings, we train the model to classify
samples in the current episode amongst all speakers present in the training
set. Furthermore, we propose a new training regime for faster model convergence
by extracting more information from a given meta-learning episode with
negligible extra computation. We evaluate the proposed techniques on VoxCeleb,
SITW and VCTK datasets on the tasks of speaker verification and unseen speaker
identification. The proposed approach outperforms the existing approaches
consistently on both tasks.
Related papers
- Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Multi-task Learning for Speaker Verification and Voice Trigger Detection [18.51531434428444]
We investigate training a single network to perform both tasks jointly.
We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data.
Results demonstrate that the network is able to encode both phonetic emphand speaker information in its learnt representations.
arXiv Detail & Related papers (2020-01-26T21:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.