CASP-Net: Rethinking Video Saliency Prediction from an
Audio-VisualConsistency Perceptual Perspective
- URL: http://arxiv.org/abs/2303.06357v1
- Date: Sat, 11 Mar 2023 09:29:57 GMT
- Title: CASP-Net: Rethinking Video Saliency Prediction from an
Audio-VisualConsistency Perceptual Perspective
- Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao
Zhai
- Abstract summary: Video Saliency Prediction (VSP) imitates the selective attention mechanism of human brain.
Most VSP methods exploit semantic correlation between vision and audio modalities but ignore the negative effects due to the temporal inconsistency of audio-visual intrinsics.
Inspired by the biological inconsistency-correction within multi-sensory information, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed.
- Score: 30.995357472421404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incorporating the audio stream enables Video Saliency Prediction (VSP) to
imitate the selective attention mechanism of human brain. By focusing on the
benefits of joint auditory and visual information, most VSP methods are capable
of exploiting semantic correlation between vision and audio modalities but
ignoring the negative effects due to the temporal inconsistency of audio-visual
intrinsics. Inspired by the biological inconsistency-correction within
multi-sensory information, in this study, a consistency-aware audio-visual
saliency prediction network (CASP-Net) is proposed, which takes a comprehensive
consideration of the audio-visual semantic interaction and consistent
perception. In addition a two-stream encoder for elegant association between
video frames and corresponding sound source, a novel consistency-aware
predictive coding is also designed to improve the consistency within audio and
visual representations iteratively. To further aggregate the multi-scale
audio-visual information, a saliency decoder is introduced for the final
saliency map generation. Substantial experiments demonstrate that the proposed
CASP-Net outperforms the other state-of-the-art methods on six challenging
audio-visual eye-tracking datasets. For a demo of our system please see our
project webpage.
Related papers
- Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits [22.558134249701794]
We propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS)
CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visualworks.
Experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters.
arXiv Detail & Related papers (2022-12-21T03:28:30Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Learning Audio-Visual Correlations from Variational Cross-Modal
Generation [35.07257471319274]
We learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner.
The learned correlations can be readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval.
arXiv Detail & Related papers (2021-02-05T21:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.