AudioVisual Video Summarization
- URL: http://arxiv.org/abs/2105.07667v1
- Date: Mon, 17 May 2021 08:36:10 GMT
- Title: AudioVisual Video Summarization
- Authors: Bin Zhao, Maoguo Gong, Xuelong Li
- Abstract summary: In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
- Score: 103.47766795086206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio and vision are two main modalities in video data. Multimodal learning,
especially for audiovisual learning, has drawn considerable attention recently,
which can boost the performance of various computer vision tasks. However, in
video summarization, existing approaches just exploit the visual information
while neglect the audio information. In this paper, we argue that the audio
modality can assist vision modality to better understand the video content and
structure, and further benefit the summarization process. Motivated by this, we
propose to jointly exploit the audio and visual information for the video
summarization task, and develop an AudioVisual Recurrent Network (AVRN) to
achieve this. Specifically, the proposed AVRN can be separated into three
parts: 1) the two-stream LSTM is utilized to encode the audio and visual
feature sequentially by capturing their temporal dependency. 2) the audiovisual
fusion LSTM is employed to fuse the two modalities by exploring the latent
consistency between them. 3) the self-attention video encoder is adopted to
capture the global dependency in the video. Finally, the fused audiovisual
information, and the integrated temporal and global dependencies are jointly
used to predict the video summary. Practically, the experimental results on the
two benchmarks, \emph{i.e.,} SumMe and TVsum, have demonstrated the
effectiveness of each part, and the superiority of AVRN compared to those
approaches just exploiting visual information for video summarization.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Role of Audio in Audio-Visual Video Summarization [8.785359786012302]
We propose a new audio-visual video summarization framework integrating four ways of audio-visual information fusion with GRU-based and attention-based networks.
Experimental evaluations on the TVSum dataset attain F1 score and Kendall-tau score improvements for the audio-visual video summarization.
arXiv Detail & Related papers (2022-12-02T09:11:49Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Audiovisual SlowFast Networks for Video Recognition [140.08143162600354]
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts.
We report results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.
arXiv Detail & Related papers (2020-01-23T18:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.