A Gated Fusion Network for Dynamic Saliency Prediction
- URL: http://arxiv.org/abs/2102.07682v1
- Date: Mon, 15 Feb 2021 17:18:37 GMT
- Title: A Gated Fusion Network for Dynamic Saliency Prediction
- Authors: Aysun Kocak, Erkut Erdem and Aykut Erdem
- Abstract summary: Gated Fusion Network for dynamic saliency (GFSalNet)
GFSalNet is first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism.
We show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
- Score: 16.701214795454536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting saliency in videos is a challenging problem due to complex
modeling of interactions between spatial and temporal information, especially
when ever-changing, dynamic nature of videos is considered. Recently,
researchers have proposed large-scale datasets and models that take advantage
of deep learning as a way to understand what's important for video saliency.
These approaches, however, learn to combine spatial and temporal features in a
static manner and do not adapt themselves much to the changes in the video
content. In this paper, we introduce Gated Fusion Network for dynamic saliency
(GFSalNet), the first deep saliency model capable of making predictions in a
dynamic way via gated fusion mechanism. Moreover, our model also exploits
spatial and channel-wise attention within a multi-scale architecture that
further allows for highly accurate predictions. We evaluate the proposed
approach on a number of datasets, and our experimental analysis demonstrates
that it outperforms or is highly competitive with the state of the art.
Importantly, we show that it has a good generalization ability, and moreover,
exploits temporal information more effectively via its adaptive fusion scheme.
Related papers
- On the Benefits of Instance Decomposition in Video Prediction Models [5.653106385738823]
State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects.
This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others.
In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models.
arXiv Detail & Related papers (2025-01-17T21:36:06Z) - Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction [84.26340606752763]
In this paper, we introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework.
The network is designed to conform to the general symmetry conservation law via symmetry where conservative and non-conservative information passes over a multiscale space by a latent temporal marching strategy.
Results demonstrate that CiGNN exhibits remarkable baseline accuracy and generalizability, and is readily applicable to learning for prediction of varioustemporal dynamics.
arXiv Detail & Related papers (2024-12-30T13:55:59Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Lightweight Stochastic Video Prediction via Hybrid Warping [10.448675566568086]
Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine.
We propose a novel long-term complexity video prediction model that focuses on dynamic regions by employing a hybrid warping strategy.
Considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model.
arXiv Detail & Related papers (2024-12-04T06:33:27Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z) - Goal-driven Self-Attentive Recurrent Networks for Trajectory Prediction [31.02081143697431]
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and video-surveillance applications.
We propose a lightweight attention-based recurrent backbone that acts solely on past observed positions.
We employ a common goal module, based on a U-Net architecture, which additionally extracts semantic information to predict scene-compliant destinations.
arXiv Detail & Related papers (2022-04-25T11:12:37Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.