Temporal-Spatial Feature Pyramid for Video Saliency Detection
- URL: http://arxiv.org/abs/2105.04213v1
- Date: Mon, 10 May 2021 09:14:14 GMT
- Title: Temporal-Spatial Feature Pyramid for Video Saliency Detection
- Authors: Qinyao Chang, Shiping Zhu, Lanyun Zhu
- Abstract summary: We propose a 3D fully convolutional encoder-decoder architecture for video saliency detection.
Our model is simple yet effective, and can run in real time.
- Score: 2.578242050187029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a 3D fully convolutional encoder-decoder
architecture for video saliency detection, which combines scale, space and time
information for video saliency modeling. The encoder extracts multi-scale
temporal-spatial features from the input continuous video frames, and then
constructs temporal-spatial feature pyramid through temporal-spatial
convolution and top-down feature integration. The decoder performs hierarchical
decoding of temporal-spatial features from different scales, and finally
produces a saliency map from the integration of multiple video frames. Our
model is simple yet effective, and can run in real time. We perform abundant
experiments, and the results indicate that the well-designed structure can
improve the precision of video saliency detection significantly. Experimental
results on three purely visual video saliency benchmarks and six audio-video
saliency benchmarks demonstrate that our method achieves state-of-theart
performance.
Related papers
- GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting [28.981174430968643]
We introduce a novel neural representation that combines 3D Gaussian splatting with continuous camera motion modeling.
Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency.
This memory-efficient approach achieves high-quality rendering at impressive speeds.
arXiv Detail & Related papers (2025-01-08T19:01:12Z) - VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [62.6737516863285]
VideoLifter is a novel framework that incrementally optimize a globally sparse to dense 3D representation directly from video sequences.
By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure.
This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.
arXiv Detail & Related papers (2025-01-03T18:52:36Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Three-Stage Cascade Framework for Blurry Video Frame Interpolation [23.38547327916875]
Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos.
BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance.
We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
arXiv Detail & Related papers (2023-10-09T03:37:30Z) - A Spatial-Temporal Deformable Attention based Framework for Breast
Lesion Detection in Videos [107.96514633713034]
We propose a spatial-temporal deformable attention based framework, named STNet.
Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion.
Experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance.
arXiv Detail & Related papers (2023-09-09T07:00:10Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition [24.220358793070965]
We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv)
We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames.
The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-22T09:20:49Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.