Improving Multi-Scale Aggregation Using Feature Pyramid Module for
Robust Speaker Verification of Variable-Duration Utterances
- URL: http://arxiv.org/abs/2004.03194v4
- Date: Thu, 6 Aug 2020 06:09:29 GMT
- Title: Improving Multi-Scale Aggregation Using Feature Pyramid Module for
Robust Speaker Verification of Variable-Duration Utterances
- Authors: Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim
- Abstract summary: We propose a module that enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections.
It achieves better performance than state-of-the-art approaches for both short and long utterances.
- Score: 15.887661651035712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.
Related papers
- Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification [14.58145497173618]
We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification.<n>LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging.
arXiv Detail & Related papers (2025-12-15T07:39:56Z) - Exploring Speaker Diarization with Mixture of Experts [39.02603646215667]
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture.<n>The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
arXiv Detail & Related papers (2025-06-17T17:42:54Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.
This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism.
The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z) - Multistep feature aggregation framework for salient object detection [0.0]
We introduce a multistep feature aggregation framework for salient object detection.
It is composed of three modules, including the Diverse Reception (DR) module, multiscale interaction (MSI) module and Feature Enhancement (FE) module.
Experimental results on six benchmark datasets demonstrate that MSFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-12T16:13:16Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Self-Attentive Multi-Layer Aggregation with Feature Recalibration and
Normalization for End-to-End Speaker Verification System [8.942112181408158]
We propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system.
Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models.
arXiv Detail & Related papers (2020-07-27T08:10:46Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.