Audio-Visual Glance Network for Efficient Video Recognition
- URL: http://arxiv.org/abs/2308.09322v1
- Date: Fri, 18 Aug 2023 05:46:20 GMT
- Title: Audio-Visual Glance Network for Efficient Video Recognition
- Authors: Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim
- Abstract summary: We propose Audio-Visual Network (AVGN) to efficiently process the-temporally important parts of a video.
We use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame.
We incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN.
- Score: 17.95844876568496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning has made significant strides in video understanding tasks, but
the computation required to classify lengthy and massive videos using
clip-level video classifiers remains impractical and prohibitively expensive.
To address this issue, we propose Audio-Visual Glance Network (AVGN), which
leverages the commonly available audio and visual modalities to efficiently
process the spatio-temporally important parts of a video. AVGN firstly divides
the video into snippets of image-audio clip pair and employs lightweight
unimodal encoders to extract global visual features and audio features. To
identify the important temporal segments, we use an Audio-Visual Temporal
Saliency Transformer (AV-TeST) that estimates the saliency scores of each
frame. To further increase efficiency in the spatial dimension, AVGN processes
only the important patches instead of the whole images. We use an
Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of
enhanced coarse visual features, which are fed to a policy network that
produces the coordinates of the important patches. This approach enables us to
focus only on the most important spatio-temporally parts of the video, leading
to more efficient video recognition. Moreover, we incorporate various training
techniques and multi-modal feature fusion to enhance the robustness and
effectiveness of our AVGN. By combining these strategies, our AVGN sets new
state-of-the-art performance in multiple video recognition benchmarks while
achieving faster processing speed.
Related papers
- Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities.
In this study, we strengthen the video features by learning three temporal dynamics in video data.
We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Localizing Visual Sounds the Hard Way [149.84890978170174]
We train the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound.
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
We introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset.
arXiv Detail & Related papers (2021-04-06T17:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.