A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key!
- URL: http://arxiv.org/abs/2206.13390v1
- Date: Mon, 20 Jun 2022 07:25:13 GMT
- Title: A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key!
- Authors: Chenglizhao Chen and Mengke Song and Wenfeng Song and Li Guo and Muwei
Jian
- Abstract summary: Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
- Score: 25.436683033432086
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video saliency detection (VSD) aims at fast locating the most attractive
objects/things/patterns in a given video clip. Existing VSD-related works have
mainly relied on the visual system but paid less attention to the audio aspect,
while, actually, our audio system is the most vital complementary part to our
visual system. Also, audio-visual saliency detection (AVSD), one of the most
representative research topics for mimicking human perceptual mechanisms, is
currently in its infancy, and none of the existing survey papers have touched
on it, especially from the perspective of saliency detection. Thus, the
ultimate goal of this paper is to provide an extensive review to bridge the gap
between audio-visual fusion and saliency detection. In addition, as another
highlight of this review, we have provided a deep insight into key factors
which could directly determine the performances of AVSD deep models, and we
claim that the audio-visual consistency degree (AVC) -- a long-overlooked
issue, can directly influence the effectiveness of using audio to benefit its
visual counterpart when performing saliency detection. Moreover, in order to
make the AVC issue more practical and valuable for future followers, we have
newly equipped almost all existing publicly available AVSD datasets with
additional frame-wise AVC labels. Based on these upgraded datasets, we have
conducted extensive quantitative evaluations to ground our claim on the
importance of AVC in the AVSD task. In a word, both our ideas and new sets
serve as a convenient platform with preliminaries and guidelines, all of which
are very potential to facilitate future works in promoting state-of-the-art
(SOTA) performance further.
Related papers
- Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model [50.15552768350462]
This paper comprehensively investigates audio-visual attention in omnidirectional videos (ODVs) from both subjective and objective perspectives.
To advance the research on audio-visual saliency prediction for ODVs, we establish a new benchmark based on the AVS-ODV database.
arXiv Detail & Related papers (2024-08-10T02:45:46Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Pay Self-Attention to Audio-Visual Navigation [24.18976027602831]
We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy.
Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
arXiv Detail & Related papers (2022-10-04T03:42:36Z) - AVA-AVD: Audio-visual Speaker Diarization in the Wild [26.97787596025907]
Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios.
We propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.
arXiv Detail & Related papers (2021-11-29T11:02:41Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.