STAViS: Spatio-Temporal AudioVisual Saliency Network
- URL: http://arxiv.org/abs/2001.03063v2
- Date: Sun, 14 Jun 2020 18:45:08 GMT
- Title: STAViS: Spatio-Temporal AudioVisual Saliency Network
- Authors: Antigoni Tsiami, Petros Koutras and Petros Maragos
- Abstract summary: STAViS is a network that combines visual saliency and auditory features.
It learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map.
We compare our method against 8 different state-of-the-art visual saliency models.
- Score: 45.04894808904767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce STAViS, a spatio-temporal audiovisual saliency network that
combines spatio-temporal visual and auditory information in order to
efficiently address the problem of saliency estimation in videos. Our approach
employs a single network that combines visual saliency and auditory features
and learns to appropriately localize sound sources and to fuse the two
saliencies in order to obtain a final saliency map. The network has been
designed, trained end-to-end, and evaluated on six different databases that
contain audiovisual eye-tracking data of a large variety of videos. We compare
our method against 8 different state-of-the-art visual saliency models.
Evaluation results across databases indicate that our STAViS model outperforms
our visual only variant as well as the other state-of-the-art models in the
majority of cases. Also, the consistently good performance it achieves for all
databases indicates that it is appropriate for estimating saliency
"in-the-wild". The code is available at https://github.com/atsiami/STAViS.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z) - Audiovisual Saliency Prediction in Uncategorized Video Sequences based
on Audio-Video Correlation [0.0]
This work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features.
The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset.
arXiv Detail & Related papers (2021-01-07T14:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.