Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation
- URL: http://arxiv.org/abs/2411.19210v1
- Date: Thu, 28 Nov 2024 15:30:56 GMT
- Title: Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation
- Authors: Finlay G. C. Hudson, William A. P. Smith,
- Abstract summary: Track Anything Behind Everything (TABE) is a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks.
Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible.
Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction.
- Score: 15.272149101494005
- License:
- Abstract: We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.
Related papers
- LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes.
We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net)
The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising [37.216493829454706]
We explore the potential of applying the Segment Anything Model to track and segment objects in videos.
Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame.
To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
arXiv Detail & Related papers (2024-03-07T03:52:59Z) - Amodal Ground Truth and Completion in the Wild [84.54972153436466]
We use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images.
This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels.
arXiv Detail & Related papers (2023-12-28T18:59:41Z) - TAO-Amodal: A Benchmark for Tracking Any Object Amodally [41.5396827282691]
We introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences.
Our dataset includes textitamodal and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame.
arXiv Detail & Related papers (2023-12-19T18:58:40Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Robust Visual Tracking by Segmentation [103.87369380021441]
Estimating the target extent poses a fundamental challenge in visual object tracking.
We propose a segmentation-centric tracking pipeline that produces a highly accurate segmentation mask.
Our tracker is able to better learn a target representation that clearly differentiates the target in the scene from background content.
arXiv Detail & Related papers (2022-03-21T17:59:19Z) - Self-Supervised Scene De-occlusion [186.89979151728636]
This paper investigates the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects.
We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions.
Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion.
arXiv Detail & Related papers (2020-04-06T16:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.