Segment Aggregation for short utterances speaker verification using raw
waveforms
- URL: http://arxiv.org/abs/2005.03329v3
- Date: Tue, 4 Aug 2020 05:40:15 GMT
- Title: Segment Aggregation for short utterances speaker verification using raw
waveforms
- Authors: Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim and Ha-Jin Yu
- Abstract summary: We propose a method that compensates for the performance degradation of speaker verification for short utterances.
The proposed method adopts an ensemble-based design to improve the stability and accuracy of speaker verification systems.
- Score: 47.41124427552161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most studies on speaker verification systems focus on long-duration
utterances, which are composed of sufficient phonetic information. However, the
performances of these systems are known to degrade when short-duration
utterances are inputted due to the lack of phonetic information as compared to
the long utterances. In this paper, we propose a method that compensates for
the performance degradation of speaker verification for short utterances,
referred to as "segment aggregation". The proposed method adopts an
ensemble-based design to improve the stability and accuracy of speaker
verification systems. The proposed method segments an input utterance into
several short utterances and then aggregates the segment embeddings extracted
from the segmented inputs to compose a speaker embedding. Then, this method
simultaneously trains the segment embeddings and the aggregated speaker
embedding. In addition, we also modified the teacher-student learning method
for the proposed method. Experimental results on different input duration using
the VoxCeleb1 test set demonstrate that the proposed technique improves speaker
verification performance by about 45.37% relatively compared to the baseline
system with 1-second test utterance condition.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.