Related papers: Temporal-Spatial Feature Pyramid for Video Saliency Detection

Temporal-Spatial Feature Pyramid for Video Saliency Detection

URL: http://arxiv.org/abs/2105.04213v1
Date: Mon, 10 May 2021 09:14:14 GMT
Title: Temporal-Spatial Feature Pyramid for Video Saliency Detection
Authors: Qinyao Chang, Shiping Zhu, Lanyun Zhu
Abstract summary: We propose a 3D fully convolutional encoder-decoder architecture for video saliency detection. Our model is simple yet effective, and can run in real time.
Score: 2.578242050187029
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for video saliency modeling. The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales, and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective, and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can improve the precision of video saliency detection significantly. Experimental results on three purely visual video saliency benchmarks and six audio-video saliency benchmarks demonstrate that our method achieves state-of-theart performance.

Related papers

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting [28.981174430968643]
We introduce a novel neural representation that combines 3D Gaussian splatting with continuous camera motion modeling. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency. This memory-efficient approach achieves high-quality rendering at impressive speeds.
arXiv Detail & Related papers (2025-01-08T19:01:12Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
On the Relevance of Temporal Features for Medical Ultrasound Video Recognition [0.0]
We propose a novel multi-head attention architecture to achieve better sample efficiency on common ultrasound tasks. We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings. These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime.
arXiv Detail & Related papers (2023-10-16T14:35:29Z)
Three-Stage Cascade Framework for Blurry Video Frame Interpolation [23.38547327916875]
Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos. BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance. We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
arXiv Detail & Related papers (2023-10-09T03:37:30Z)
A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos [107.96514633713034]
We propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. Experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance.
arXiv Detail & Related papers (2023-09-09T07:00:10Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame. IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z)
Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision. We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z)
Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time. "TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches. TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z)
Learnable Sampling 3D Convolution for Video Enhancement and Action Recognition [24.220358793070965]
We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv) We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames. The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-22T09:20:49Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.