AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization
- URL: http://arxiv.org/abs/2210.05060v1
- Date: Tue, 11 Oct 2022 00:15:45 GMT
- Title: AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization
- Authors: Tanvir Mahmud, Diana Marculescu
- Abstract summary: We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer.
Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
- Score: 14.103742565510387
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: An audio-visual event (AVE) is denoted by the correspondence of the visual
and auditory signals in a video segment. Precise localization of the AVEs is
very challenging since it demands effective multi-modal feature correspondence
to ground the short and long range temporal interactions. Existing approaches
struggle in capturing the different scales of multi-modal interaction due to
ineffective multi-modal training strategies. To overcome this limitation, we
introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained
on large-scale audio-visual data with a multi-window temporal transformer to
effectively operate on different temporal scales of video frames. Our
contributions are three-fold: (1) We introduce a multi-stage training framework
to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE
localization task on video frames through contrastive fine-tuning, effective
mean video feature extraction, and multi-scale training phases. (2) We propose
a multi-domain attention mechanism that operates on both temporal and feature
domains over varying timescales to fuse the local and global feature
variations. (3) We introduce a temporal refining scheme with event-guided
attention followed by a simple-yet-effective post processing step to handle
significant variations of the background over diverse events. Our method
achieves state-of-the-art performance on the publicly available AVE dataset
with 5.9% mean accuracy improvement which proves its superiority over existing
approaches.
Related papers
- Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition [28.49695567630899]
First-person activity recognition is rapidly growing due to the widespread use of wearable cameras.
We propose a framework that improves domain generalization by integrating motion, audio, and appearance features.
Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.
arXiv Detail & Related papers (2024-09-15T04:43:00Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.