Masked Autoencoders with Multi-Window Local-Global Attention Are Better
Audio Learners
- URL: http://arxiv.org/abs/2306.00561v2
- Date: Sun, 1 Oct 2023 21:53:36 GMT
- Title: Masked Autoencoders with Multi-Window Local-Global Attention Are Better
Audio Learners
- Authors: Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen and Zheng-Hua Tan
- Abstract summary: Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module.
MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations.
- Score: 17.747301325787618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted
with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates
the modelling of local-global interactions in every decoder transformer block
through attention heads of several distinct local and global windows. Empirical
results on ten downstream audio tasks show that MW-MAEs consistently outperform
standard MAEs in overall performance and learn better general-purpose audio
representations, along with demonstrating considerably better scaling
characteristics. Investigating attention distances and entropies reveals that
MW-MAE encoders learn heads with broader local and global attention. Analyzing
attention head feature representations through Projection Weighted Canonical
Correlation Analysis (PWCCA) shows that attention heads with the same window
sizes across the decoder layers of the MW-MAE learn correlated feature
representations which enables each block to independently capture local and
global information, leading to a decoupled decoder feature hierarchy. Code for
feature extraction and downstream experiments along with pre-trained models
will be released publically.
Related papers
- MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition [0.19285000127136376]
This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition.
We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition.
arXiv Detail & Related papers (2024-12-25T21:52:31Z) - MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning [3.520960737058199]
We introduce Multimodal Masked Autoenco-Based One-Shot Learning (Mu-MAE)
Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors.
It achieves up to an 80.17% accuracy five-way one-shot multimodal classification for classification without the use of additional data.
arXiv Detail & Related papers (2024-08-08T06:16:00Z) - INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation [19.461033552684576]
We propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification.
LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities.
arXiv Detail & Related papers (2024-06-25T16:12:20Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.