Time-Conditioned Generative Modeling of Object-Centric Representations
for Video Decomposition and Prediction
- URL: http://arxiv.org/abs/2301.08951v4
- Date: Thu, 26 Oct 2023 10:07:02 GMT
- Title: Time-Conditioned Generative Modeling of Object-Centric Representations
for Video Decomposition and Prediction
- Authors: Chengmin Gao and Bin Li
- Abstract summary: We introduce a time-conditioned generative model for videos.
We show that the model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.
- Score: 4.79974591281424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When perceiving the world from multiple viewpoints, humans have the ability
to reason about the complete objects in a compositional manner even when an
object is completely occluded from certain viewpoints. Meanwhile, humans are
able to imagine novel views after observing multiple viewpoints. Recent
remarkable advances in multi-view object-centric learning still leaves some
unresolved problems: 1) The shapes of partially or completely occluded objects
can not be well reconstructed. 2) The novel viewpoint prediction depends on
expensive viewpoint annotations rather than implicit rules in view
representations. In this paper, we introduce a time-conditioned generative
model for videos. To reconstruct the complete shape of an object accurately, we
enhance the disentanglement between the latent representations of objects and
views, where the latent representations of time-conditioned views are jointly
inferred with a Transformer and then are input to a sequential extension of
Slot Attention to learn object-centric representations. In addition, Gaussian
processes are employed as priors of view latent variables for video generation
and novel-view prediction without viewpoint annotations. Experiments on
multiple datasets demonstrate that the proposed model can make object-centric
video decomposition, reconstruct the complete shapes of occluded objects, and
make novel-view predictions.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Unsupervised Object-Centric Learning from Multiple Unspecified
Viewpoints [45.88397367354284]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2024-01-03T15:09:25Z) - UpFusion: Novel View Diffusion from Unposed Sparse View Observations [66.36092764694502]
UpFusion can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images.
We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images.
arXiv Detail & Related papers (2023-12-11T18:59:55Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Partial-View Object View Synthesis via Filtered Inversion [77.282967562509]
FINV learns shape priors by training a 3D generative model.
We show that FINV successfully synthesizes novel views of real-world objects.
arXiv Detail & Related papers (2023-04-03T00:59:31Z) - AutoRF: Learning 3D Object Radiance Fields from Single View Observations [17.289819674602295]
AutoRF is a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.
We show that our method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes.
arXiv Detail & Related papers (2022-04-07T17:13:39Z) - Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints [41.07379505694274]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2021-12-07T08:45:21Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.