M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
- URL: http://arxiv.org/abs/2412.13803v2
- Date: Thu, 19 Dec 2024 12:31:34 GMT
- Title: M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
- Authors: Zixuan Chen, Jiaxin Li, Liming Tan, Yejie Guo, Junxuan Liang, Cewu Lu, Yong-Lu Li,
- Abstract summary: We introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes.
We present a new benchmark, Multi-Phase, Multi-Transition and Multi-Scenery Video Object (M$3$-VOS), to verify the ability of models to understand object phases.
We propose ReVOS, a new plug-andplay model that improves its performance by reversal refinement.
- Score: 51.82272563578793
- License:
- Abstract: Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g., phase transitions. However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M$^3$-VOS, yielding several key insights. Notably, current appearancebased approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-andplay model that improves its performance by reversal refinement. Our data and code will be publicly available at https://zixuan-chen.github.io/M-cubeVOS.github.io/.
Related papers
- OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z) - Learning Robust Dynamics through Variational Sparse Gating [18.476155786474358]
In environments with many objects, often only a small number of them are moving or interacting at the same time.
In this paper, we investigate integrating this inductive bias of sparse interactions into the latent dynamics of world models trained from pixels.
arXiv Detail & Related papers (2022-10-21T02:56:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.