Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation
- URL: http://arxiv.org/abs/2506.23271v1
- Date: Sun, 29 Jun 2025 14:52:01 GMT
- Title: Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation
- Authors: Jinxing Zhou, Zhihui Li, Yongqiang Yu, Yanghao Zhou, Ruohao Guo, Guangyao Li, Yuxin Mao, Mingfei Han, Xiaojun Chang, Meng Wang,
- Abstract summary: We present textbfMeta-textbfToken textbfLearning (Mettle), a memory-efficient method for adapting pretrained transformer models to audio-visual tasks.<n>Mettle utilizes a lightweight textitLayer-Centric Distillation (LCD) module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens.
- Score: 44.98679295002702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present \textbf{Met}a-\textbf{T}oken \textbf{Le}arning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textit{Layer-Centric Distillation (LCD)} module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textit{Meta-Token Injection (MTI)} module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.
Related papers
- Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z) - MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning [38.95630141905818]
Mixture of Layer-Wise Tokens (MoLT) is a parameter- and memory-efficient adaptation framework for audio-visual learning.<n>We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner.<n>A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance.
arXiv Detail & Related papers (2025-11-27T14:32:55Z) - Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z) - AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation [33.315479764894086]
We introduce AV-DiT, a novel and efficient audio-visual diffusion transformer.
A shared DiT backbone pre-trained on image-only data facilitates both audio and video generation.
Experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation.
arXiv Detail & Related papers (2024-06-11T20:05:58Z) - MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Vision Transformers are Parameter-Efficient Audio-Visual Learners [95.59258503297195]
We propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks.
Our approach achieves competitive or even better performance on various audio-visual tasks.
arXiv Detail & Related papers (2022-12-15T17:31:54Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Audiomer: A Convolutional Transformer for Keyword Spotting [0.0]
We introduce Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting.
Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.
arXiv Detail & Related papers (2021-09-21T15:28:41Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Streaming Transformer-based Acoustic Models Using Self-attention with
Augmented Memory [23.022723184325017]
Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion.
We propose a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories.
arXiv Detail & Related papers (2020-05-16T16:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.