Related papers: A Gated Fusion Network for Dynamic Saliency Prediction

A Gated Fusion Network for Dynamic Saliency Prediction

URL: http://arxiv.org/abs/2102.07682v1
Date: Mon, 15 Feb 2021 17:18:37 GMT
Title: A Gated Fusion Network for Dynamic Saliency Prediction
Authors: Aysun Kocak, Erkut Erdem and Aykut Erdem
Abstract summary: Gated Fusion Network for dynamic saliency (GFSalNet) GFSalNet is first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. We show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
Score: 16.701214795454536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however, learn to combine spatial and temporal features in a static manner and do not adapt themselves much to the changes in the video content. In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. Moreover, our model also exploits spatial and channel-wise attention within a multi-scale architecture that further allows for highly accurate predictions. We evaluate the proposed approach on a number of datasets, and our experimental analysis demonstrates that it outperforms or is highly competitive with the state of the art. Importantly, we show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.

Related papers

On the Benefits of Instance Decomposition in Video Prediction Models [5.653106385738823]
State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models.
arXiv Detail & Related papers (2025-01-17T21:36:06Z)
Multi-Head Self-Attending Neural Tucker Factorization [5.734615417239977]
We introduce a neural network-based tensor factorization approach tailored for learning representations of high-dimensional and incomplete (HDI) tensors. The proposed MSNTucF model demonstrates superior performance compared to state-of-the-art benchmark models in estimating missing observations.
arXiv Detail & Related papers (2025-01-16T13:04:15Z)
Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction [84.26340606752763]
In this paper, we introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework. The network is designed to conform to the general symmetry conservation law via symmetry where conservative and non-conservative information passes over a multiscale space by a latent temporal marching strategy. Results demonstrate that CiGNN exhibits remarkable baseline accuracy and generalizability, and is readily applicable to learning for prediction of varioustemporal dynamics.
arXiv Detail & Related papers (2024-12-30T13:55:59Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
Lightweight Stochastic Video Prediction via Hybrid Warping [10.448675566568086]
Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. We propose a novel long-term complexity video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. Considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model.
arXiv Detail & Related papers (2024-12-04T06:33:27Z)
Characterized Diffusion Networks for Enhanced Autonomous Driving Trajectory Prediction [0.6202955567445396]
We present a novel trajectory prediction model for autonomous driving. Our model enhances the accuracy and reliability of trajectory predictions by incorporating uncertainty estimation and complex agent interactions. The proposed model showcases strong potential for application in real-world autonomous driving systems.
arXiv Detail & Related papers (2024-11-25T15:03:44Z)
Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques. We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs [13.436770170612295]
We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. Our model infers both independent dynamics and their interactions with reliable uncertainty estimates.
arXiv Detail & Related papers (2022-05-24T08:36:25Z)
Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z)
Goal-driven Self-Attentive Recurrent Networks for Trajectory Prediction [31.02081143697431]
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and video-surveillance applications. We propose a lightweight attention-based recurrent backbone that acts solely on past observed positions. We employ a common goal module, based on a U-Net architecture, which additionally extracts semantic information to predict scene-compliant destinations.
arXiv Detail & Related papers (2022-04-25T11:12:37Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion. To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z)
Deep learning of contagion dynamics on complex networks [0.0]
We propose a complementary approach based on deep learning to build effective models of contagion dynamics on networks. By allowing simulations on arbitrary network structures, our approach makes it possible to explore the properties of the learned dynamics beyond the training data. Our results demonstrate how deep learning offers a new and complementary perspective to build effective models of contagion dynamics on networks.
arXiv Detail & Related papers (2020-06-09T17:18:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.