Weakly Supervised Visual-Auditory Saliency Detection with
Multigranularity Perception
- URL: http://arxiv.org/abs/2112.13697v1
- Date: Mon, 27 Dec 2021 14:13:30 GMT
- Title: Weakly Supervised Visual-Auditory Saliency Detection with
Multigranularity Perception
- Authors: Guotao Wang, Chenglizhao Chen, Dengping Fan, Aimin Hao, and Hong Qin
- Abstract summary: Deep learning-based visualaudio fixation prediction is still in its infancy.
It would be neither efficient nor necessary to recollect real fixations under the same visual-audio circumstances.
This paper promotes a novel approach in a weakly supervised manner to alleviate the demand of large-scale training sets for visual-audio model training.
- Score: 46.84865384147999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thanks to the rapid advances in deep learning techniques and the wide
availability of large-scale training sets, the performance of video saliency
detection models has been improving steadily and significantly. However, deep
learning-based visualaudio fixation prediction is still in its infancy. At
present, only a few visual-audio sequences have been furnished, with real
fixations being recorded in real visual-audio environments. Hence, it would be
neither efficient nor necessary to recollect real fixations under the same
visual-audio circumstances. To address this problem, this paper promotes a
novel approach in a weakly supervised manner to alleviate the demand of
large-scale training sets for visual-audio model training. By using only the
video category tags, we propose the selective class activation mapping (SCAM)
and its upgrade (SCAM+). In the spatial-temporal-audio circumstance, the former
follows a coarse-to-fine strategy to select the most discriminative regions,
and these regions are usually capable of exhibiting high consistency with the
real human-eye fixations. The latter equips the SCAM with an additional
multi-granularity perception mechanism, making the whole process more
consistent with that of the real human visual system. Moreover, we distill
knowledge from these regions to obtain complete new spatial-temporal-audio
(STA) fixation prediction (FP) networks, enabling broad applications in cases
where video tags are not available. Without resorting to any real human-eye
fixation, the performances of these STA FP networks are comparable to those of
fully supervised networks. The code and results are publicly available at
https://github.com/guotaowang/STANet.
Related papers
- CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.