Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation
- URL: http://arxiv.org/abs/2309.13248v1
- Date: Sat, 23 Sep 2023 04:12:02 GMT
- Title: Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation
- Authors: Ke Fan, Jingshi Lei, Xuelin Qian, Miaopeng Yu, Tianjun Xiao, Tong He,
Zheng Zhang, Yanwei Fu
- Abstract summary: Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
- Score: 47.39455910191075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video amodal segmentation is a particularly challenging task in computer
vision, which requires to deduce the full shape of an object from the visible
parts of it. Recently, some studies have achieved promising performance by
using motion flow to integrate information across frames under a
self-supervised setting. However, motion flow has a clear limitation by the two
factors of moving cameras and object deformation. This paper presents a
rethinking to previous works. We particularly leverage the supervised signals
with object-centric representation in \textit{real-world scenarios}. The
underlying idea is the supervision signal of the specific object and the
features from different views can mutually benefit the deduction of the full
mask in any specific frame. We thus propose an Efficient object-centric
Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on
supervision signals, we design a translation module to project image features
into the Bird's-Eye View (BEV), which introduces 3D information to improve
current feature quality. Furthermore, we propose a multi-view fusion layer
based temporal module which is equipped with a set of object slots and
interacts with features from different views by attention mechanism to fulfill
sufficient object representation completion. As a result, the full mask of the
object can be decoded from image features updated by object slots. Extensive
experiments on both real-world and synthetic benchmarks demonstrate the
superiority of our proposed method, achieving state-of-the-art performance. Our
code will be released at \url{https://github.com/kfan21/EoRaS}.
Related papers
- ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Unsupervised Video Object Segmentation via Prototype Memory Network [5.612292166628669]
Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame.
This challenge requires extracting features for the most salient common objects within a video sequence.
We propose a novel prototype memory network architecture to solve this problem.
arXiv Detail & Related papers (2022-09-08T11:08:58Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.