VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks
- URL: http://arxiv.org/abs/2508.12061v1
- Date: Sat, 16 Aug 2025 14:26:59 GMT
- Title: VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks
- Authors: Daria Diatlova, Nikita Balagansky, Alexander Varlamov, Egor Spirin,
- Abstract summary: We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs.<n>VARAN adaptively prioritizes layer's features based on input.
- Score: 43.690582061831954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional methods for aggregating layers in fine-tuned self-supervised speech models, such as using the final layer or weighted sum, suffer from information bottlenecks and static feature weighting for all dataset examples. We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs. By employing layer-specialized probing heads and data-dependent weighting, VARAN adaptively prioritizes layer's features based on input. Evaluations on automatic speech recognition and speech emotion recognition tasks demonstrate VARAN's superior performance, particularly when using the LoRA fine-tuning technique. The framework resolves the trade-off between preserving layer-specific information and enabling flexible feature utilization, advancing efficient adaptation of self-supervised speech representations.
Related papers
- RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging [33.22889542330089]
Internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge.<n>We propose RECALL, a representation-aware model merging framework for continual learning without access to historical data.
arXiv Detail & Related papers (2025-10-23T12:17:37Z) - Self-supervised Latent Space Optimization with Nebula Variational Coding [87.20343320266215]
This paper proposes a variational inference model which leads to a clustered embedding.<n>We introduce additional variables in the latent space, called textbfnebula anchors, that guide the latent variables to form clusters during training.<n>Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit.
arXiv Detail & Related papers (2025-06-02T08:13:32Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse
Finetuning [24.765911297156855]
FISH-DIP is a sample-aware dynamic sparse finetuning strategy that selectively focuses on a fraction of parameters.
We demonstrate that FISH-DIP can smoothly optimize the model in low resource settings offering upto 40% performance improvements.
arXiv Detail & Related papers (2023-11-07T06:19:37Z) - Trading Information between Latents in Hierarchical Variational
Autoencoders [8.122270502556374]
Variational Autoencoders (VAEs) were originally motivated as probabilistic generative models in which one performs approximate Bayesian inference.
The proposal of $beta$-VAEs breaks this interpretation and generalizes VAEs to application domains beyond generative modeling.
We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently.
arXiv Detail & Related papers (2023-02-09T18:56:11Z) - Comparative layer-wise analysis of self-supervised speech models [29.258085176788097]
We measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA)
We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective.
We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models.
arXiv Detail & Related papers (2022-11-08T00:59:05Z) - Attention-based conditioning methods using variable frame rate for
style-robust speaker verification [21.607777746331998]
We propose an approach to extract speaker embeddings robust to speaking style variations in text-independent speaker verification.
An entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer.
arXiv Detail & Related papers (2022-06-28T01:14:09Z) - Hierarchical Variational Memory for Few-shot Learning Across Domains [120.87679627651153]
We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory.
The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand.
We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model.
arXiv Detail & Related papers (2021-12-15T15:01:29Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Self-Attention Generative Adversarial Network for Speech Enhancement [37.14341228976058]
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation.
We propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN.
Experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance.
arXiv Detail & Related papers (2020-10-18T22:59:07Z) - Self-Supervised Tuning for Few-Shot Segmentation [82.32143982269892]
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples.
Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space.
This paper presents an adaptive framework tuning, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme.
arXiv Detail & Related papers (2020-04-12T03:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.