Towards Visual Foundational Models of Physical Scenes
- URL: http://arxiv.org/abs/2306.03727v1
- Date: Tue, 6 Jun 2023 14:45:44 GMT
- Title: Towards Visual Foundational Models of Physical Scenes
- Authors: Chethan Parameshwara, Alessandro Achille, Matthew Trager, Xiaolong Li,
Jiawei Mo, Matthew Trager, Ashwin Swaminathan, CJ Taylor, Dheera Venkatraman,
Xiaohan Fei, Stefano Soatto
- Abstract summary: We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion.
We first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique.
- Score: 107.40546386739422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a first step towards learning general-purpose visual
representations of physical scenes using only image prediction as a training
criterion. To do so, we first define "physical scene" and show that, even
though different agents may maintain different representations of the same
scene, the underlying physical scene that can be inferred is unique. Then, we
show that NeRFs cannot represent the physical scene, as they lack extrapolation
mechanisms. Those, however, could be provided by Diffusion Models, at least in
theory. To test this hypothesis empirically, NeRFs can be combined with
Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised
representations of the physical scene. Our analysis is limited to visual data,
without external grounding mechanisms that can be provided by independent
sensory modalities.
Related papers
- Toward a Diffusion-Based Generalist for Dense Vision Tasks [141.03236279493686]
Recent works have shown image itself can be used as a natural interface for general-purpose visual perception.
We propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks.
In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
arXiv Detail & Related papers (2024-06-29T17:57:22Z) - DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment [34.821255203019554]
Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face.
Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images.
We present Diffusion, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment.
arXiv Detail & Related papers (2024-03-25T21:46:53Z) - Diffusion Priors for Dynamic View Synthesis from Monocular Videos [59.42406064983643]
Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos.
We first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique.
We distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields.
arXiv Detail & Related papers (2024-01-10T23:26:41Z) - Prediction of Scene Plausibility [11.641785968519114]
Plausibility can be defined both in terms of physical properties and in terms of functional and typical arrangements.
We build a dataset of synthetic images containing both plausible and implausible scenes.
We test the success of various vision models in the task of recognizing and understanding plausibility.
arXiv Detail & Related papers (2022-12-02T22:22:16Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Neural Implicit Representations for Physical Parameter Inference from a Single Video [49.766574469284485]
We propose to combine neural implicit representations for appearance modeling with neural ordinary differential equations (ODEs) for modelling physical phenomena.
Our proposed model combines several unique advantages: (i) Contrary to existing approaches that require large training datasets, we are able to identify physical parameters from only a single video.
The use of neural implicit representations enables the processing of high-resolution videos and the synthesis of photo-realistic images.
arXiv Detail & Related papers (2022-04-29T11:55:35Z) - A model for full local image interpretation [8.048166434189522]
We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene.
Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing.
We discuss implications of the model for visual interpretation by humans and by computer vision models.
arXiv Detail & Related papers (2021-10-17T07:20:53Z) - Learning to Identify Physical Parameters from Video Using Differentiable
Physics [2.15242029196761]
We propose a differentiable physics engine within an action-conditional video representation network to learn a physical latent representation.
We demonstrate that our network can learn to encode images and identify physical properties like mass and friction from videos and action sequences.
arXiv Detail & Related papers (2020-09-17T13:36:57Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.