VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient
Document Detection
- URL: http://arxiv.org/abs/2301.04447v1
- Date: Wed, 11 Jan 2023 13:07:31 GMT
- Title: VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient
Document Detection
- Authors: Hemraj Singh, Mridula Verma, Ramalingaswamy Cheruku
- Abstract summary: We propose VS-Net, which captures multi-scaletemporal information with the help of dilated depth-wise separable convolution and Approximation Rank Pooling.
Our model generates saliency maps considering both the background and foreground, making it perform better in challenging scenarios.
The immense experiments regulated on the benchmark MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art approaches in both time and robustness measures.
- Score: 0.2578242050187029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Salient Document Detection (VSDD) is an essential task of practical
computer vision, which aims to highlight visually salient document regions in
video frames. Previous techniques for VSDD focus on learning features without
considering the cooperation among and across the appearance and motion cues and
thus fail to perform in practical scenarios. Moreover, most of the previous
techniques demand high computational resources, which limits the usage of such
systems in resource-constrained settings. To handle these issues, we propose
VS-Net, which captures multi-scale spatiotemporal information with the help of
dilated depth-wise separable convolution and Approximation Rank Pooling. VS-Net
extracts the key features locally from each frame across embedding sub-spaces
and forwards the features between adjacent and parallel nodes, enhancing model
performance globally. Our model generates saliency maps considering both the
background and foreground simultaneously, making it perform better in
challenging scenarios. The immense experiments regulated on the benchmark
MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art
approaches in both time and robustness measures.
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - MVFuseNet: Improving End-to-End Object Detection and Motion Forecasting
through Multi-View Fusion of LiDAR Data [4.8061970432391785]
We propose itMVFusenet, a novel end-to-end method for joint object detection motion forecasting from a temporal sequence of LiDAR data.
We show the benefits of our multi-view approach for the tasks of detection and motion forecasting on two large-scale self-driving data sets.
arXiv Detail & Related papers (2021-04-21T21:29:08Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks.
We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.