Related papers: Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

URL: http://arxiv.org/abs/2512.22148v1
Date: Mon, 15 Dec 2025 07:39:56 GMT
Title: Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification
Authors: Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han,
Abstract summary: We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification.<n>LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging.
Score: 14.58145497173618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

Related papers

Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z)
Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z)
Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction [0.0]
This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach.<n>We aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios.
arXiv Detail & Related papers (2024-12-24T00:20:38Z)
Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. We present a novel approach for exploiting the multilayered nature of pretrained models for ASV. We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics [22.129910930772]
Keywords Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples. We propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone.
arXiv Detail & Related papers (2023-07-24T10:04:27Z)
Comparative layer-wise analysis of self-supervised speech models [29.258085176788097]
We measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA) We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models.
arXiv Detail & Related papers (2022-11-08T00:59:05Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach. TS-VAD directly predicts an activity of each speaker on each time frame. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances [15.887661651035712]
We propose a module that enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. It achieves better performance than state-of-the-art approaches for both short and long utterances.
arXiv Detail & Related papers (2020-04-07T08:35:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.