Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation
- URL: http://arxiv.org/abs/2106.05210v1
- Date: Wed, 9 Jun 2021 16:50:57 GMT
- Title: Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation
- Authors: Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
- Score: 68.45737688496654
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a simple yet effective approach to modeling space-time
correspondences in the context of video object segmentation. Unlike most
existing approaches, we establish correspondences directly between frames
without re-encoding the mask features for every object, leading to a highly
efficient and robust framework. With the correspondences, every node in the
current query frame is inferred by aggregating features from the past in an
associative fashion. We cast the aggregation process as a voting problem and
find that the existing inner-product affinity leads to poor use of memory with
a small (fixed) subset of memory nodes dominating the votes, regardless of the
query. In light of this phenomenon, we propose using the negative squared
Euclidean distance instead to compute the affinities. We validated that every
memory node now has a chance to contribute, and experimentally showed that such
diversified voting is beneficial to both memory efficiency and inference
accuracy. The synergy of correspondence networks and diversified voting works
exceedingly well, achieves new state-of-the-art results on both DAVIS and
YouTubeVOS datasets while running significantly faster at 20+ FPS for multiple
objects without bells and whistles.
Related papers
- Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods.
These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching.
In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation.
Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z) - SWEM: Towards Real-Time Video Object Segmentation with Sequential
Weighted Expectation-Maximization [36.43412404616356]
We propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features.
SWEM combines intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm.
Experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3% $mathcalJ&mathcalF$ on DAVIS 2017 validation dataset)
arXiv Detail & Related papers (2022-08-22T08:03:59Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Efficient Regional Memory Network for Video Object Segmentation [56.587541750729045]
We propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet)
The proposed RMNet effectively alleviates the ambiguity of similar objects in both memory and query frames.
Experimental results indicate that the proposed RMNet performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-03-24T02:08:46Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.