Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching
- URL: http://arxiv.org/abs/2007.05687v1
- Date: Sat, 11 Jul 2020 05:44:16 GMT
- Title: Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching
- Authors: Xuhua Huang, Jiarui Xu, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
- Score: 67.02962970820505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in Video Object Segmentation (VOS), the
video object tracking task in its finest level. While the VOS task can be
naturally decoupled into image semantic segmentation and video object tracking,
significantly much more research effort has been made in segmentation than
tracking. In this paper, we introduce "tracking-by-detection" into VOS which
can coherently integrate segmentation into tracking, by proposing a new
temporal aggregation network and a novel dynamic time-evolving template
matching mechanism to achieve significantly improved performance. Notably, our
method is entirely online and thus suitable for one-shot learning, and our
end-to-end trainable model allows multiple object segmentation in one forward
pass. We achieve new state-of-the-art performance on the DAVIS benchmark
without complicated bells and whistles in both speed and accuracy, with a speed
of 0.14 second per frame and J&F measure of 75.9% respectively.
Related papers
- Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.
MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.
We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [1.6584112749108326]
TCDSG, Temporally Consistent Dynamic Scene Graphs, is an end-to-end framework that detects, tracks, and links subject-object relationships across time.
Our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
arXiv Detail & Related papers (2024-12-03T20:19:20Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - SiamMask: A Framework for Fast Online Object Tracking and Segmentation [96.61632757952292]
SiamMask is a framework to perform both visual object tracking and video object segmentation, in real-time, with the same simple method.
We show that it is possible to extend the framework to handle multiple object tracking and segmentation by simply re-using the multi-task model.
It yields real-time state-of-the-art results on visual-object tracking benchmarks, while at the same time demonstrating competitive performance at a high speed for video object segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T14:47:17Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.