HODOR: High-level Object Descriptors for Object Re-segmentation in Video
Learned from Static Images
- URL: http://arxiv.org/abs/2112.09131v1
- Date: Thu, 16 Dec 2021 18:59:53 GMT
- Title: HODOR: High-level Object Descriptors for Object Re-segmentation in Video
Learned from Static Images
- Authors: Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian
Leibe
- Abstract summary: We propose HODOR: a novel method that effectively leveraging annotated static images for understanding object appearance and scene context.
As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks.
Without any architectural modification, HODOR can also learn from video context around single annotated video frames.
- Score: 123.65233334380251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing state-of-the-art methods for Video Object Segmentation (VOS) learn
low-level pixel-to-pixel correspondences between frames to propagate object
masks across video. This requires a large amount of densely annotated video
data, which is costly to annotate, and largely redundant since frames within a
video are highly correlated. In light of this, we propose HODOR: a novel method
that tackles VOS by effectively leveraging annotated static images for
understanding object appearance and scene context. We encode object instances
and scene information from an image frame into robust high-level descriptors
which can then be used to re-segment those objects in different frames. As a
result, HODOR achieves state-of-the-art performance on the DAVIS and
YouTube-VOS benchmarks compared to existing methods trained without video
annotations. Without any architectural modification, HODOR can also learn from
video context around single annotated video frames by utilizing cyclic
consistency, whereas other methods rely on dense, temporally consistent
annotations.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.