MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
- URL: http://arxiv.org/abs/2512.00115v1
- Date: Thu, 27 Nov 2025 14:32:55 GMT
- Title: MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
- Authors: Kyeongha Rho, Hyeongkeun Lee, Jae Won Cho, Joon Son Chung,
- Abstract summary: Mixture of Layer-Wise Tokens (MoLT) is a parameter- and memory-efficient adaptation framework for audio-visual learning.<n>We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner.<n>A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance.
- Score: 38.95630141905818
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.
Related papers
- LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers [53.43862310647276]
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors.<n>We introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation.<n>Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.
arXiv Detail & Related papers (2025-07-06T14:35:43Z) - Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation [44.98679295002702]
We present textbfMeta-textbfToken textbfLearning (Mettle), a memory-efficient method for adapting pretrained transformer models to audio-visual tasks.<n>Mettle utilizes a lightweight textitLayer-Centric Distillation (LCD) module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens.
arXiv Detail & Related papers (2025-06-29T14:52:01Z) - Enhancing Latent Computation in Transformers with Latent Tokens [48.371764897314]
Augmenting large language models with auxiliary tokens has emerged as a promising strategy for enhancing model performance.<n>We introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language.<n>The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time.
arXiv Detail & Related papers (2025-05-19T02:35:53Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.