Graph Attention Networks for Speaker Verification
- URL: http://arxiv.org/abs/2010.11543v2
- Date: Mon, 8 Feb 2021 08:12:17 GMT
- Title: Graph Attention Networks for Speaker Verification
- Authors: Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, Joon Son Chung
- Abstract summary: This work presents a novel back-end framework for speaker verification using graph attention networks.
We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks.
After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform.
- Score: 43.01058120303278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a novel back-end framework for speaker verification using
graph attention networks. Segment-wise speaker embeddings extracted from
multiple crops within an utterance are interpreted as node representations of a
graph. The proposed framework inputs segment-wise speaker embeddings from an
enrollment and a test utterance and directly outputs a similarity score. We
first construct a graph using segment-wise speaker embeddings and then input
these to graph attention networks. After a few graph attention layers with
residual connections, each node is projected into a one-dimensional space using
affine transform, followed by a readout operation resulting in a scalar
similarity score. To enable successful adaptation for speaker verification, we
propose techniques such as separating trainable weights for attention map
calculations between segment-wise speaker embeddings from different utterances.
The effectiveness of the proposed framework is validated using three different
speaker embedding extractors trained with different architectures and objective
functions. Experimental results demonstrate consistent improvement over various
baseline back-end classifiers, with an average equal error rate improvement of
20% over the cosine similarity back-end without test time augmentation.
Related papers
- SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding [56.079013202051094]
We present SegVG, a novel method transfers the box-level annotation as signals to provide an additional pixel-level supervision for Visual Grounding.
This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation.
arXiv Detail & Related papers (2024-07-03T15:30:45Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Joint Graph Learning and Matching for Semantic Feature Correspondence [69.71998282148762]
We propose a joint emphgraph learning and matching network, named GLAM, to explore reliable graph structures for boosting graph matching.
The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k)
It outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks.
arXiv Detail & Related papers (2021-09-01T08:24:02Z) - Graph-based Label Propagation for Semi-Supervised Speaker Identification [10.87690067963342]
We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario.
We show that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods.
arXiv Detail & Related papers (2021-06-15T15:10:33Z) - Speaker attribution with voice profiles by graph-based semi-supervised
learning [29.042995008709916]
We propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods.
A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes and segments from test utterances are unlabeled nodes.
Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs)
arXiv Detail & Related papers (2021-02-06T18:35:56Z) - Weakly Supervised Training of Hierarchical Attention Networks for
Speaker Identification [37.33388614967888]
A hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally.
arXiv Detail & Related papers (2020-05-15T22:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.