Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector
- URL: http://arxiv.org/abs/2203.16966v1
- Date: Thu, 31 Mar 2022 11:36:09 GMT
- Title: Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector
- Authors: Lu Cheng and Mingbo Zhao
- Abstract summary: Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
- Score: 17.46922710432633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human video instance segmentation plays an important role in computer
understanding of human activities and is widely used in video processing, video
surveillance, and human modeling in virtual reality. Most current VIS methods
are based on Mask-RCNN framework, where the target appearance and motion
information for data matching will increase computational cost and have an
impact on segmentation real-time performance; on the other hand, the existing
datasets for VIS focus less on all the people appearing in the video. In this
paper, to solve the problems, we develop a new method for human video instance
segmentation based on single-stage detector. To tracking the instance across
the video, we have adopted data association strategy for matching the same
instance in the video sequence, where we jointly learn target instance
appearances and their affinities in a pair of video frames in an end-to-end
fashion. We have also adopted the centroid sampling strategy for enhancing the
embedding extraction ability of instance, which is to bias the instance
position to the inside of each instance mask with heavy overlap condition. As a
result, even there exists a sudden change in the character activity, the
instance position will not move out of the mask, so that the problem that the
same instance is represented by two different instances can be alleviated.
Finally, we collect PVIS dataset by assembling several video instance
segmentation datasets to fill the gap of the current lack of datasets dedicated
to human video segmentation. Extensive simulations based on such dataset has
been conduct. Simulation results verify the effectiveness and efficiency of the
proposed work.
Related papers
- Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection [12.417754433715903]
We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features.
Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU.
arXiv Detail & Related papers (2024-12-06T10:12:10Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Reliable Shot Identification for Complex Event Detection via
Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos.
Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability.
An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.