Role of Audio in Audio-Visual Video Summarization
- URL: http://arxiv.org/abs/2212.01040v1
- Date: Fri, 2 Dec 2022 09:11:49 GMT
- Title: Role of Audio in Audio-Visual Video Summarization
- Authors: Ibrahim Shoer, Berkay Kopru, Engin Erzin
- Abstract summary: We propose a new audio-visual video summarization framework integrating four ways of audio-visual information fusion with GRU-based and attention-based networks.
Experimental evaluations on the TVSum dataset attain F1 score and Kendall-tau score improvements for the audio-visual video summarization.
- Score: 8.785359786012302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video summarization attracts attention for efficient video representation,
retrieval, and browsing to ease volume and traffic surge problems. Although
video summarization mostly uses the visual channel for compaction, the benefits
of audio-visual modeling appeared in recent literature. The information coming
from the audio channel can be a result of audio-visual correlation in the video
content. In this study, we propose a new audio-visual video summarization
framework integrating four ways of audio-visual information fusion with
GRU-based and attention-based networks. Furthermore, we investigate a new
explainability methodology using audio-visual canonical correlation analysis
(CCA) to better understand and explain the role of audio in the video
summarization task. Experimental evaluations on the TVSum dataset attain F1
score and Kendall-tau score improvements for the audio-visual video
summarization. Furthermore, splitting video content on TVSum and COGNIMUSE
datasets based on audio-visual CCA as positively and negatively correlated
videos yields a strong performance improvement over the positively correlated
videos for audio-only and audio-visual video summarization.
Related papers
- Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Audio-visual training for improved grounding in video-text LLMs [1.9320359360360702]
We propose a model architecture that handles audio-visual inputs explicitly.
We train our model with both audio and visual data from a video instruction-tuning dataset.
For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset.
arXiv Detail & Related papers (2024-07-21T03:59:14Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.