Temporal-Spatial Feature Pyramid for Video Saliency Detection
- URL: http://arxiv.org/abs/2105.04213v1
- Date: Mon, 10 May 2021 09:14:14 GMT
- Title: Temporal-Spatial Feature Pyramid for Video Saliency Detection
- Authors: Qinyao Chang, Shiping Zhu, Lanyun Zhu
- Abstract summary: We propose a 3D fully convolutional encoder-decoder architecture for video saliency detection.
Our model is simple yet effective, and can run in real time.
- Score: 2.578242050187029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a 3D fully convolutional encoder-decoder
architecture for video saliency detection, which combines scale, space and time
information for video saliency modeling. The encoder extracts multi-scale
temporal-spatial features from the input continuous video frames, and then
constructs temporal-spatial feature pyramid through temporal-spatial
convolution and top-down feature integration. The decoder performs hierarchical
decoding of temporal-spatial features from different scales, and finally
produces a saliency map from the integration of multiple video frames. Our
model is simple yet effective, and can run in real time. We perform abundant
experiments, and the results indicate that the well-designed structure can
improve the precision of video saliency detection significantly. Experimental
results on three purely visual video saliency benchmarks and six audio-video
saliency benchmarks demonstrate that our method achieves state-of-theart
performance.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - On the Relevance of Temporal Features for Medical Ultrasound Video
Recognition [0.0]
We propose a novel multi-head attention architecture to achieve better sample efficiency on common ultrasound tasks.
We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings.
These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime.
arXiv Detail & Related papers (2023-10-16T14:35:29Z) - Three-Stage Cascade Framework for Blurry Video Frame Interpolation [23.38547327916875]
Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos.
BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance.
We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
arXiv Detail & Related papers (2023-10-09T03:37:30Z) - A Spatial-Temporal Deformable Attention based Framework for Breast
Lesion Detection in Videos [107.96514633713034]
We propose a spatial-temporal deformable attention based framework, named STNet.
Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion.
Experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance.
arXiv Detail & Related papers (2023-09-09T07:00:10Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition [24.220358793070965]
We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv)
We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames.
The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-22T09:20:49Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.