Real-time Human-Centric Segmentation for Complex Video Scenes
- URL: http://arxiv.org/abs/2108.07199v1
- Date: Mon, 16 Aug 2021 16:07:51 GMT
- Title: Real-time Human-Centric Segmentation for Complex Video Scenes
- Authors: Ran Yu, Chenyu Tian, Weihao Xia, Xinyuan Zhao, Haoqian Wang, Yujiu
Yang
- Abstract summary: Most existing video tasks related to "human" focus on the segmentation of salient humans, ignoring the unspecified others in the video.
Few studies have focused on segmenting and tracking all humans in a complex video, including pedestrians and humans of other states.
We propose a novel framework, abbreviated as HVISNet, that segments and tracks all presented people in given videos based on a one-stage detector.
- Score: 16.57620683425904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing video tasks related to "human" focus on the segmentation of
salient humans, ignoring the unspecified others in the video. Few studies have
focused on segmenting and tracking all humans in a complex video, including
pedestrians and humans of other states (e.g., seated, riding, or occluded). In
this paper, we propose a novel framework, abbreviated as HVISNet, that segments
and tracks all presented people in given videos based on a one-stage detector.
To better evaluate complex scenes, we offer a new benchmark called HVIS (Human
Video Instance Segmentation), which comprises 1447 human instance masks in 805
high-resolution videos in diverse scenes. Extensive experiments show that our
proposed HVISNet outperforms the state-of-the-art methods in terms of accuracy
at a real-time inference speed (30 FPS), especially on complex video scenes. We
also notice that using the center of the bounding box to distinguish different
individuals severely deteriorates the segmentation accuracy, especially in
heavily occluded conditions. This common phenomenon is referred to as the
ambiguous positive samples problem. To alleviate this problem, we propose a
mechanism named Inner Center Sampling to improve the accuracy of instance
segmentation. Such a plug-and-play inner center sampling mechanism can be
incorporated in any instance segmentation models based on a one-stage detector
to improve the performance. In particular, it gains 4.1 mAP improvement on the
state-of-the-art method in the case of occluded humans. Code and data are
available at https://github.com/IIGROUP/HVISNet.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame
Interpolation [11.198172694893927]
SportsSloMo is a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($geq$720p) slow-motion sports videos crawled from YouTube.
We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets.
We introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection.
arXiv Detail & Related papers (2023-08-31T17:23:50Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector [17.46922710432633]
Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
arXiv Detail & Related papers (2022-03-31T11:36:09Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.