SpVOS: Efficient Video Object Segmentation with Triple Sparse
Convolution
- URL: http://arxiv.org/abs/2310.15115v1
- Date: Mon, 23 Oct 2023 17:21:33 GMT
- Title: SpVOS: Efficient Video Object Segmentation with Triple Sparse
Convolution
- Authors: Weihao Lin, Tao Chen, Chong Yu
- Abstract summary: This work develops a novel triple sparse convolution to reduce the computation costs of the overall video object segmentation framework.
Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS.
Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance.
- Score: 18.332130780309797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised video object segmentation (Semi-VOS), which requires only
annotating the first frame of a video to segment future frames, has received
increased attention recently. Among existing pipelines, the
memory-matching-based one is becoming the main research stream, as it can fully
utilize the temporal sequence information to obtain high-quality segmentation
results. Even though this type of method has achieved promising performance,
the overall framework still suffers from heavy computation overhead, mainly
caused by the per-frame dense convolution operations between high-resolution
feature maps and each kernel filter. Therefore, we propose a sparse baseline of
VOS named SpVOS in this work, which develops a novel triple sparse convolution
to reduce the computation costs of the overall VOS framework. The designed
triple gate, taking full consideration of both spatial and temporal redundancy
between adjacent video frames, adaptively makes a triple decision to decide how
to apply the sparse convolution on each pixel to control the computation
overhead of each layer, while maintaining sufficient discrimination capability
to distinguish similar objects and avoid error accumulation. A mixed sparse
training strategy, coupled with a designed objective considering the sparsity
constraint, is also developed to balance the VOS segmentation performance and
computation costs. Experiments are conducted on two mainstream VOS datasets,
including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves
superior performance over other state-of-the-art sparse methods, and even
maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the
DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS
baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to
42% FLOPs, showing its application potential for resource-constrained
scenarios.
Related papers
- fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data.
Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - OneVOS: Unifying Video Object Segmentation with All-in-One Transformer
Framework [24.947436083365925]
OneVOS is a novel framework that unifies the core components of VOS with All-in-One Transformer.
OneVOS achieves state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% $J & F$, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively.
arXiv Detail & Related papers (2024-03-13T16:38:26Z) - Spectrum-guided Multi-granularity Referring Video Object Segmentation [56.95836951559529]
Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features.
This causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation.
We propose a Spectrum-guided Multi-granularity approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks.
arXiv Detail & Related papers (2023-07-25T14:35:25Z) - Efficient Semantic Segmentation by Altering Resolutions for Compressed
Videos [42.944135041061166]
We propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient video segmentation.
AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes.
Experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-13T15:58:15Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously
Exploiting Image and Event Modalities [6.543272301133159]
Event cameras detect changes in per-pixel intensity to generate asynchronous event streams.
They offer great potential for accurate semantic map retrieval in real-time autonomous systems.
Existing implementations for event segmentation suffer from sub-based performance.
We propose hybrid end-to-end learning framework HALSIE to reduce inference cost by up to $20times$ versus art.
arXiv Detail & Related papers (2022-11-19T17:09:50Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.