Video Saliency Prediction Using Enhanced Spatiotemporal Alignment
Network
- URL: http://arxiv.org/abs/2001.00292v1
- Date: Thu, 2 Jan 2020 02:05:35 GMT
- Title: Video Saliency Prediction Using Enhanced Spatiotemporal Alignment
Network
- Authors: Jin Chen, Huihui Song, Kaihua Zhang, Bo Liu, Qingshan Liu
- Abstract summary: We develop an effective feature alignment network tailored to video saliency prediction (V)
The network learns to align the features of the neighboring frames to the reference one in a coarse-to-fine manner.
The proposed model is trained end-to-end without any post processing.
- Score: 35.932447204088845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to a variety of motions across different frames, it is highly challenging
to learn an effective spatiotemporal representation for accurate video saliency
prediction (VSP). To address this issue, we develop an effective spatiotemporal
feature alignment network tailored to VSP, mainly including two key
sub-networks: a multi-scale deformable convolutional alignment network (MDAN)
and a bidirectional convolutional Long Short-Term Memory (Bi-ConvLSTM) network.
The MDAN learns to align the features of the neighboring frames to the
reference one in a coarse-to-fine manner, which can well handle various
motions. Specifically, the MDAN owns a pyramidal feature hierarchy structure
that first leverages deformable convolution (Dconv) to align the
lower-resolution features across frames, and then aggregates the aligned
features to align the higher-resolution features, progressively enhancing the
features from top to bottom. The output of MDAN is then fed into the
Bi-ConvLSTM for further enhancement, which captures the useful long-time
temporal information along forward and backward timing directions to
effectively guide attention orientation shift prediction under complex scene
transformation. Finally, the enhanced features are decoded to generate the
predicted saliency map. The proposed model is trained end-to-end without any
intricate post processing. Extensive evaluations on four VSP benchmark datasets
demonstrate that the proposed method achieves favorable performance against
state-of-the-art methods. The source codes and all the results will be
released.
Related papers
- Double-Shot 3D Shape Measurement with a Dual-Branch Network [14.749887303860717]
We propose a dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet) to process different structured light (SL) modalities.
Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images.
We show that our method can reduce fringe order ambiguity while producing high-accuracy results on a self-made dataset.
arXiv Detail & Related papers (2024-07-19T10:49:26Z) - Collaborative Feedback Discriminative Propagation for Video Super-Resolution [66.61201445650323]
Key success of video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information.
Inaccurate alignment usually leads to aligned features with significant artifacts.
propagation modules only propagate the same timestep features forward or backward.
arXiv Detail & Related papers (2024-04-06T22:08:20Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - 360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception [56.84921040837699]
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results.
We propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics.
We also present an unsupervised adaptation technique tailored for horizon-depth and ratio representations.
Our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks.
arXiv Detail & Related papers (2023-12-26T12:16:03Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Deep Recurrent Neural Network with Multi-scale Bi-directional
Propagation for Video Deblurring [36.94523101375519]
We propose a deep Recurrent Neural Network with Multi-scale Bi-directional Propagation (RNN-MBP) to propagate and gather information from unaligned neighboring frames for better video deblurring.
To better evaluate the proposed algorithm and existing state-of-the-art methods on real-world blurry scenes, we also create a Real-World Blurry Video dataset.
The proposed algorithm performs favorably against the state-of-the-art methods on three typical benchmarks.
arXiv Detail & Related papers (2021-12-09T11:02:56Z) - Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent
Neural Network [14.796204921975733]
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) in a single snapshot.
It is challenging for existing model-based decoding algorithms to reconstruct each individual scene.
We propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds.
arXiv Detail & Related papers (2021-09-11T14:24:44Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - End-to-end Neural Video Coding Using a Compound Spatiotemporal
Representation [33.54844063875569]
We propose a hybrid motion compensation (HMC) method that adaptively combines the predictions generated by two approaches.
Specifically, we generate a compoundtemporal representation (STR) through a recurrent information aggregation (RIA) module.
We further design a one-to-many decoder pipeline to generate multiple predictions from the CSTR, including vector-based resampling, adaptive kernel-based resampling, compensation mode selection maps and texture enhancements.
arXiv Detail & Related papers (2021-08-05T19:43:32Z) - A Deep-Unfolded Reference-Based RPCA Network For Video
Foreground-Background Separation [86.35434065681925]
This paper proposes a new deep-unfolding-based network design for the problem of Robust Principal Component Analysis (RPCA)
Unlike existing designs, our approach focuses on modeling the temporal correlation between the sparse representations of consecutive video frames.
Experimentation using the moving MNIST dataset shows that the proposed network outperforms a recently proposed state-of-the-art RPCA network in the task of video foreground-background separation.
arXiv Detail & Related papers (2020-10-02T11:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.