Related papers: Human Instance Segmentation and Tracking via Data Association and Single-stage Detector

Human Instance Segmentation and Tracking via Data Association and Single-stage Detector

URL: http://arxiv.org/abs/2203.16966v1
Date: Thu, 31 Mar 2022 11:36:09 GMT
Title: Human Instance Segmentation and Tracking via Data Association and Single-stage Detector
Authors: Lu Cheng and Mingbo Zhao
Abstract summary: Human video instance segmentation plays an important role in computer understanding of human activities. Most current VIS methods are based on Mask-RCNN framework. We develop a new method for human video instance segmentation based on single-stage detector.
Score: 17.46922710432633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human video instance segmentation plays an important role in computer understanding of human activities and is widely used in video processing, video surveillance, and human modeling in virtual reality. Most current VIS methods are based on Mask-RCNN framework, where the target appearance and motion information for data matching will increase computational cost and have an impact on segmentation real-time performance; on the other hand, the existing datasets for VIS focus less on all the people appearing in the video. In this paper, to solve the problems, we develop a new method for human video instance segmentation based on single-stage detector. To tracking the instance across the video, we have adopted data association strategy for matching the same instance in the video sequence, where we jointly learn target instance appearances and their affinities in a pair of video frames in an end-to-end fashion. We have also adopted the centroid sampling strategy for enhancing the embedding extraction ability of instance, which is to bias the instance position to the inside of each instance mask with heavy overlap condition. As a result, even there exists a sudden change in the character activity, the instance position will not move out of the mask, so that the problem that the same instance is represented by two different instances can be alleviated. Finally, we collect PVIS dataset by assembling several video instance segmentation datasets to fill the gap of the current lack of datasets dedicated to human video segmentation. Extensive simulations based on such dataset has been conduct. Simulation results verify the effectiveness and efficiency of the proposed work.

Related papers

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection [12.417754433715903]
We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU.
arXiv Detail & Related papers (2024-12-06T10:12:10Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
Context-Aware Video Instance Segmentation [12.71520768233772]
We introduce the Context-Aware Video Instance (CAVIS), a novel framework designed to enhance instance association. We propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. We also introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames.
arXiv Detail & Related papers (2024-07-03T11:11:16Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
End-to-end video instance segmentation via spatial-temporal graph neural networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain. Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step. We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z)
Reliable Shot Identification for Complex Event Detection via Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos. Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability. An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z)
1st Place Solution for YouTubeVOS Challenge 2021:Video Instance Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously. We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack) By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z)
CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects. We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.