Occluded Video Instance Segmentation
        - URL: http://arxiv.org/abs/2102.01558v2
- Date: Wed, 3 Feb 2021 08:10:55 GMT
- Title: Occluded Video Instance Segmentation
- Authors: Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai,
  Serge Belongie, Alan Yuille, Philip H.S. Torr, Song Bai
- Abstract summary: We collect a large scale dataset called OVIS for occluded video instance segmentation.
OVIS consists of 296k high-quality instance masks from 25 semantic categories.
The highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.
- Score: 133.80567761430584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Can our video understanding systems perceive objects when a heavy occlusion
exists in a scene?
  To answer this question, we collect a large scale dataset called OVIS for
occluded video instance segmentation, that is, to simultaneously detect,
segment, and track instances in occluded scenes. OVIS consists of 296k
high-quality instance masks from 25 semantic categories, where object
occlusions usually occur. While our human vision systems can understand those
occluded instances by contextual reasoning and association, our experiments
suggest that current video understanding systems are not satisfying. On the
OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only
14.4, which reveals that we are still at a nascent stage for understanding
objects, instances, and videos in a real-world scenario. Moreover, to
complement missing object cues caused by occlusion, we propose a plug-and-play
module called temporal feature calibration. Built upon MaskTrack R-CNN and
SipMask, we report an AP of 15.2 and 15.0 respectively. The OVIS dataset is
released at http://songbai.site/ovis , and the project code will be available
soon.
 
      
        Related papers
        - Temporally-Constrained Video Reasoning Segmentation and Automated   Benchmark Construction [8.214041057237491]
 We introduce temporally-constrained video reasoning segmentation, a novel task formulation that requires models to implicitly infer when target objects become contextually relevant.<n>We also present TCVideoRS, a temporally-constrained video RS dataset containing 52 samples using the videos from the MVOR dataset.
 arXiv  Detail & Related papers  (2025-07-22T15:59:21Z)
- Disentangling spatio-temporal knowledge for weakly supervised object   detection and segmentation in surgical video [10.287675722826028]
 This paper introduces Video Spatio-Temporal Disment Networks (VDST-Net) to disentangle information using semi-decoupled temporal knowledge distillation to predict high-quality class activation maps (CAMs)
We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60% of annotated frames.
 arXiv  Detail & Related papers  (2024-07-22T16:52:32Z)
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
 We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
 arXiv  Detail & Related papers  (2024-07-16T02:29:29Z)
- OW-VISCap: Open-World Video Instance Segmentation and Captioning [95.6696714640357]
 We propose an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
We generate rich descriptive and object-centric captions for each detected object via a masked attention augmented LLM input.
Our approach matches or surpasses state-of-the-art on three tasks.
 arXiv  Detail & Related papers  (2024-04-04T17:59:58Z)
- MDQE: Mining Discriminative Query Embeddings to Segment Occluded
  Instances on Challenging Videos [18.041697331616948]
 We propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos.
The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos.
 arXiv  Detail & Related papers  (2023-03-25T08:13:36Z)
- MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
 Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
 arXiv  Detail & Related papers  (2023-02-03T17:20:03Z)
- VITA: Video Instance Segmentation via Object Token Association [56.17453513956142]
 VITA is a simple structure built on top of an off-shelf Transformer-based image instance segmentation model.
It accomplishes video-level understanding by associating frame-leveltemporal object tokens without using backbone features.
VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 4 AP, 49.8 AP-VIS 2019 & 2021 and 19.6 AP on OVIS.
 arXiv  Detail & Related papers  (2022-06-09T10:33:18Z)
- Human Instance Segmentation and Tracking via Data Association and
  Single-stage Detector [17.46922710432633]
 Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
 arXiv  Detail & Related papers  (2022-03-31T11:36:09Z)
- Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge [133.80567761430584]
 We collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario.
OVIS consists of 296k high-quality instance masks and 901 occluded scenes.
All baseline methods encounter a significant performance degradation of about 80% in the heavily occluded object group.
 arXiv  Detail & Related papers  (2021-11-15T17:59:03Z)
- Object Propagation via Inter-Frame Attentions for Temporally Stable
  Video Instance Segmentation [51.68840525174265]
 Video instance segmentation aims to detect, segment, and track objects in a video.
Current approaches extend image-level segmentation algorithms to the temporal domain.
We propose a video instance segmentation method that alleviates the problem due to missing detections.
 arXiv  Detail & Related papers  (2021-11-15T04:15:57Z)
- 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
  Segmentation [0.39146761527401414]
 Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
 arXiv  Detail & Related papers  (2021-06-12T00:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.