Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching
- URL: http://arxiv.org/abs/2007.05687v1
- Date: Sat, 11 Jul 2020 05:44:16 GMT
- Title: Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching
- Authors: Xuhua Huang, Jiarui Xu, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
- Score: 67.02962970820505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in Video Object Segmentation (VOS), the
video object tracking task in its finest level. While the VOS task can be
naturally decoupled into image semantic segmentation and video object tracking,
significantly much more research effort has been made in segmentation than
tracking. In this paper, we introduce "tracking-by-detection" into VOS which
can coherently integrate segmentation into tracking, by proposing a new
temporal aggregation network and a novel dynamic time-evolving template
matching mechanism to achieve significantly improved performance. Notably, our
method is entirely online and thus suitable for one-shot learning, and our
end-to-end trainable model allows multiple object segmentation in one forward
pass. We achieve new state-of-the-art performance on the DAVIS benchmark
without complicated bells and whistles in both speed and accuracy, with a speed
of 0.14 second per frame and J&F measure of 75.9% respectively.
Related papers
- Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - SiamMask: A Framework for Fast Online Object Tracking and Segmentation [96.61632757952292]
SiamMask is a framework to perform both visual object tracking and video object segmentation, in real-time, with the same simple method.
We show that it is possible to extend the framework to handle multiple object tracking and segmentation by simply re-using the multi-task model.
It yields real-time state-of-the-art results on visual-object tracking benchmarks, while at the same time demonstrating competitive performance at a high speed for video object segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T14:47:17Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.