Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
- URL: http://arxiv.org/abs/2506.21712v1
- Date: Thu, 26 Jun 2025 18:54:26 GMT
- Title: Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
- Authors: Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang,
- Abstract summary: We analyze neurons associated with k-means clusters of self-supervised features and i-vectors.<n>Our analysis reveals that these clusters correspond to broad phonetic and gender classes.<n>By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task.
- Score: 50.9040167152168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Residual Information in Deep Speaker Embedding Architectures [4.619541348328938]
This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures.
The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments.
The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
arXiv Detail & Related papers (2023-02-06T12:37:57Z) - Leveraging Speaker Embeddings with Adversarial Multi-task Learning for
Age Group Classification [0.0]
We consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups.
Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios.
arXiv Detail & Related papers (2023-01-22T05:01:13Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - What do End-to-End Speech Models Learn about Speaker, Language and
Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework.
We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification.
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.