High-resolution embedding extractor for speaker diarisation
- URL: http://arxiv.org/abs/2211.04060v1
- Date: Tue, 8 Nov 2022 07:41:18 GMT
- Title: High-resolution embedding extractor for speaker diarisation
- Authors: Hee-Soo Heo, Youngki Kwon, Bong-Jin Lee, You Jin Kim, Jee-weon Jung
- Abstract summary: This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE)
HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success.
Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
- Score: 15.392429990363492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker embedding extractors significantly influence the performance of
clustering-based speaker diarisation systems. Conventionally, only one
embedding is extracted from each speech segment. However, because of the
sliding window approach, a segment easily includes two or more speakers owing
to speaker change points. This study proposes a novel embedding extractor
architecture, referred to as a high-resolution embedding extractor (HEE), which
extracts multiple high-resolution embeddings from each speech segment. Hee
consists of a feature-map extractor and an enhancer, where the enhancer with
the self-attention mechanism is the key to success. The enhancer of HEE
replaces the aggregation process; instead of a global pooling layer, the
enhancer combines relative information to each frame via attention leveraging
the global context. Extracted dense frame-level embeddings can each represent a
speaker. Thus, multiple speakers can be represented by different frame-level
features in each segment. We also propose an artificially generating mixture
data training framework to train the proposed HEE. Through experiments on five
evaluation sets, including four public datasets, the proposed HEE demonstrates
at least 10% improvement on each evaluation set, except for one dataset, which
we analyse that rapid speaker changes less exist.
Related papers
- Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for
Speaker Diarization [41.24045486520547]
We propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN)
The proposed E-SHARC framework improves significantly over the state-of-art diarization systems.
arXiv Detail & Related papers (2024-01-23T15:35:44Z) - Generation of Speaker Representations Using Heterogeneous Training Batch
Assembly [16.534380339042087]
We propose a new CNN-based speaker modeling scheme.
We randomly and synthetically augment the training data into a set of segments.
A soft label is imposed on each segment based on its speaker occupation ratio.
arXiv Detail & Related papers (2022-03-30T19:59:05Z) - Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Weakly Supervised Training of Hierarchical Attention Networks for
Speaker Identification [37.33388614967888]
A hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally.
arXiv Detail & Related papers (2020-05-15T22:57:53Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.