Related papers: STAViS: Spatio-Temporal AudioVisual Saliency Network

STAViS: Spatio-Temporal AudioVisual Saliency Network

URL: http://arxiv.org/abs/2001.03063v2
Date: Sun, 14 Jun 2020 18:45:08 GMT
Title: STAViS: Spatio-Temporal AudioVisual Saliency Network
Authors: Antigoni Tsiami, Petros Koutras and Petros Maragos
Abstract summary: STAViS is a network that combines visual saliency and auditory features. It learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. We compare our method against 8 different state-of-the-art visual saliency models.
Score: 45.04894808904767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild". The code is available at https://github.com/atsiami/STAViS.

Related papers

DAVE: Diagnostic benchmark for Audio Visual Evaluation [43.54781776394087]
We introduce DAVE, a novel benchmark dataset designed to systematically evaluate audio-visual models. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement.
arXiv Detail & Related papers (2025-03-12T12:12:46Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention. In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space. Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification. It exploits simultaneously auditory and visual information. It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z)
Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation [0.0]
This work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features. The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset.
arXiv Detail & Related papers (2021-01-07T14:22:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.