Audiovisual Saliency Prediction in Uncategorized Video Sequences based
on Audio-Video Correlation
- URL: http://arxiv.org/abs/2101.03966v1
- Date: Thu, 7 Jan 2021 14:22:29 GMT
- Title: Audiovisual Saliency Prediction in Uncategorized Video Sequences based
on Audio-Video Correlation
- Authors: Maryam Qamar Butt and Anis Ur Rahman
- Abstract summary: This work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features.
The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Substantial research has been done in saliency modeling to develop
intelligent machines that can perceive and interpret their surroundings. But
existing models treat videos as merely image sequences excluding any audio
information, unable to cope with inherently varying content. Based on the
hypothesis that an audiovisual saliency model will be an improvement over
traditional saliency models for natural uncategorized videos, this work aims to
provide a generic audio/video saliency model augmenting a visual saliency map
with an audio saliency map computed by synchronizing low-level audio and visual
features. The proposed model was evaluated using different criteria against eye
fixations data for a publicly available DIEM video dataset. The results show
that the model outperformed two state-of-the-art visual saliency models.
Related papers
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z) - Speech Driven Video Editing via an Audio-Conditioned Diffusion Model [1.6763474728913939]
We propose a method for end-to-end speech-driven video editing using a denoising diffusion model.
We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion.
To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.
arXiv Detail & Related papers (2023-01-10T12:01:20Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - Repetitive Activity Counting by Sight and Sound [110.36526333035907]
This paper strives for repetitive activity counting in videos.
Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process.
arXiv Detail & Related papers (2021-03-24T11:15:33Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.
Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z) - NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd)
The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer.
The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.