SWEM: Towards Real-Time Video Object Segmentation with Sequential
Weighted Expectation-Maximization
- URL: http://arxiv.org/abs/2208.10128v1
- Date: Mon, 22 Aug 2022 08:03:59 GMT
- Title: SWEM: Towards Real-Time Video Object Segmentation with Sequential
Weighted Expectation-Maximization
- Authors: Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun Yuan, Wenhao
Jiang, and Wei Liu
- Abstract summary: We propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features.
SWEM combines intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm.
Experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3% $mathcalJ&mathcalF$ on DAVIS 2017 validation dataset)
- Score: 36.43412404616356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Matching-based methods, especially those based on space-time memory, are
significantly ahead of other solutions in semi-supervised video object
segmentation (VOS). However, continuously growing and redundant template
features lead to an inefficient inference. To alleviate this, we propose a
novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly
reduce the redundancy of memory features. Different from the previous methods
which only detect feature redundancy between frames, SWEM merges both
intra-frame and inter-frame similar features by leveraging the sequential
weighted EM algorithm. Further, adaptive weights for frame features endow SWEM
with the flexibility to represent hard samples, improving the discrimination of
templates. Besides, the proposed method maintains a fixed number of template
features in memory, which ensures the stable inference complexity of the VOS
system. Extensive experiments on commonly used DAVIS and YouTube-VOS datasets
verify the high efficiency (36 FPS) and high performance (84.3\%
$\mathcal{J}\&\mathcal{F}$ on DAVIS 2017 validation dataset) of SWEM. Code is
available at: https://github.com/lmm077/SWEM.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Video Semantic Segmentation with Inter-Frame Feature Fusion and
Inner-Frame Feature Refinement [39.06589186472675]
We propose a spatial-temporal fusion (STF) module to model dense pairwise relationships among multi-frame features.
Besides, we propose a novel memory-augmented refinement (MAR) module to tackle difficult predictions among semantic boundaries.
arXiv Detail & Related papers (2023-01-10T07:57:05Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs.
We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models.
We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z) - Hierarchical Dynamic Filtering Network for RGB-D Salient Object
Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information.
In this paper, we explore these issues from a new perspective.
We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.