Breaking the "Object" in Video Object Segmentation
- URL: http://arxiv.org/abs/2212.06200v2
- Date: Tue, 28 Mar 2023 16:51:28 GMT
- Title: Breaking the "Object" in Video Object Segmentation
- Authors: Pavel Tokmakov, Jie Li, Adrien Gaidon
- Abstract summary: We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
- Score: 36.20167854011788
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The appearance of an object can be fleeting when it transforms. As eggs are
broken or paper is torn, their color, shape and texture can change
dramatically, preserving virtually nothing of the original except for the
identity itself. Yet, this important phenomenon is largely absent from existing
video object segmentation (VOS) benchmarks. In this work, we close the gap by
collecting a new dataset for Video Object Segmentation under Transformations
(VOST). It consists of more than 700 high-resolution videos, captured in
diverse environments, which are 21 seconds long on average and densely labeled
with instance masks. A careful, multi-step approach is adopted to ensure that
these videos focus on complex object transformations, capturing their full
temporal extent. We then extensively evaluate state-of-the-art VOS methods and
make a number of important discoveries. In particular, we show that existing
methods struggle when applied to this novel task and that their main limitation
lies in over-reliance on static appearance cues. This motivates us to propose a
few modifications for the top-performing baseline that improve its capabilities
by better modeling spatio-temporal information. But more broadly, the hope is
to stimulate discussion on learning more robust video object representations.
Related papers
- Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention [29.62044843067169]
Video object segmentation is a fundamental research problem in computer vision.
We propose a new method for self-supervised video object segmentation based on distillation learning of deformable attention.
arXiv Detail & Related papers (2024-01-25T04:39:48Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - InstMove: Instance Motion for Object-centric Video Segmentation [70.16915119724757]
In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video.
In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings.
With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks.
arXiv Detail & Related papers (2023-03-14T17:58:44Z) - Towards Robust Video Object Segmentation with Adaptive Object
Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - HODOR: High-level Object Descriptors for Object Re-segmentation in Video
Learned from Static Images [123.65233334380251]
We propose HODOR: a novel method that effectively leveraging annotated static images for understanding object appearance and scene context.
As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks.
Without any architectural modification, HODOR can also learn from video context around single annotated video frames.
arXiv Detail & Related papers (2021-12-16T18:59:53Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z) - Few-Shot Learning for Video Object Detection in a Transfer-Learning
Scheme [70.45901040613015]
We study the new problem of few-shot learning for video object detection.
We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects.
arXiv Detail & Related papers (2021-03-26T20:37:55Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.