AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
- URL: http://arxiv.org/abs/2603.01006v1
- Date: Sun, 01 Mar 2026 09:16:46 GMT
- Title: AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
- Authors: Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu,
- Abstract summary: We introduce AG-REPA, a novel causal layer selection strategy for representation alignment in audio Flow Matching.<n>We find that layers that best store semantic/acoustic information are not necessarily the layers that contribute most to the velocity field that drives generation.<n>To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution.
- Score: 14.922065513695294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
Related papers
- General and Efficient Steering of Unconditional Diffusion [25.225845714398364]
We present a recipe for efficiently steering unconditional diffusion.<n>without gradient guidance during inference.<n>Our approach is built on two observations about diffusion model structure.<n>Experiments on CIFAR-10, ImageNet, and CelebA demonstrate improved accuracy/quality over-based gradient guidance.
arXiv Detail & Related papers (2026-02-11T21:58:26Z) - Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z) - CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation [32.72685791637924]
We propose CORD, a unified alignment framework that performs online cross-modal self-distillation.<n>Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model.<n> Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning.
arXiv Detail & Related papers (2026-01-23T08:31:24Z) - Distilling to Hybrid Attention Models via KL-Guided Layer Selection [66.06591032073744]
This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
arXiv Detail & Related papers (2025-12-23T18:12:22Z) - What matters for Representation Alignment: Global Information or Spatial Structure? [64.67092609921816]
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features.<n>We investigate a fundamental question: what aspect of the target representation matters for generation, its textitglobal revisionsemantic information.<n>We replace the standard projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation.
arXiv Detail & Related papers (2025-12-11T16:39:53Z) - Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models [4.935224714809964]
We introduce Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model's layers.<n>Specifically, aligning the local layers (Local-Align) enhances grammatical fluency.<n> aligning the global layers (Global-Align) improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence.
arXiv Detail & Related papers (2025-10-14T00:58:34Z) - Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient [61.440209025381016]
Policy reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer.<n>We show that under certain constraints, a structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges.<n>We propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer.
arXiv Detail & Related papers (2025-09-02T18:33:11Z) - Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence [131.41894248194995]
We propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner.<n>Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM)
arXiv Detail & Related papers (2025-06-16T07:55:14Z) - EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification [49.128847336227636]
We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression.<n>In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to.
arXiv Detail & Related papers (2025-05-26T08:50:23Z) - Self-Attention Generative Adversarial Network for Speech Enhancement [37.14341228976058]
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation.
We propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN.
Experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance.
arXiv Detail & Related papers (2020-10-18T22:59:07Z) - Speaker-change Aware CRF for Dialogue Act Classification [0.0]
Recent work in Dialogue Act (DA) classification approaches the task as a sequence labeling problem.
This paper proposes a simple modification of the CRF layer that takes speaker-change into account.
arXiv Detail & Related papers (2020-04-06T18:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.