AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization
- URL: http://arxiv.org/abs/2210.05060v1
- Date: Tue, 11 Oct 2022 00:15:45 GMT
- Title: AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization
- Authors: Tanvir Mahmud, Diana Marculescu
- Abstract summary: We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer.
Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
- Score: 14.103742565510387
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: An audio-visual event (AVE) is denoted by the correspondence of the visual
and auditory signals in a video segment. Precise localization of the AVEs is
very challenging since it demands effective multi-modal feature correspondence
to ground the short and long range temporal interactions. Existing approaches
struggle in capturing the different scales of multi-modal interaction due to
ineffective multi-modal training strategies. To overcome this limitation, we
introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained
on large-scale audio-visual data with a multi-window temporal transformer to
effectively operate on different temporal scales of video frames. Our
contributions are three-fold: (1) We introduce a multi-stage training framework
to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE
localization task on video frames through contrastive fine-tuning, effective
mean video feature extraction, and multi-scale training phases. (2) We propose
a multi-domain attention mechanism that operates on both temporal and feature
domains over varying timescales to fuse the local and global feature
variations. (3) We introduce a temporal refining scheme with event-guided
attention followed by a simple-yet-effective post processing step to handle
significant variations of the background over diverse events. Our method
achieves state-of-the-art performance on the publicly available AVE dataset
with 5.9% mean accuracy improvement which proves its superiority over existing
approaches.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.