CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination
- URL: http://arxiv.org/abs/2207.03961v1
- Date: Fri, 8 Jul 2022 15:28:23 GMT
- Title: CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination
- Authors: Hyounghun Kim, Abhay Zala, Mohit Bansal
- Abstract summary: We introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm)
CoSIm is designed to evaluate the ability of AI systems to reason about scene change imagination.
- Score: 87.4797527628459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As humans, we can modify our assumptions about a scene by imagining
alternative objects or concepts in our minds. For example, we can easily
anticipate the implications of the sun being overcast by rain clouds (e.g., the
street will get wet) and accordingly prepare for that. In this paper, we
introduce a new task/dataset called Commonsense Reasoning for Counterfactual
Scene Imagination (CoSIm) which is designed to evaluate the ability of AI
systems to reason about scene change imagination. In this task/dataset, models
are given an image and an initial question-response pair about the image. Next,
a counterfactual imagined scene change (in textual form) is applied, and the
model has to predict the new response to the initial question based on this
scene change. We collect 3.5K high-quality and challenging data instances, with
each instance consisting of an image, a commonsense question with a response, a
description of a counterfactual change, a new response to the question, and
three distractor responses. Our dataset contains various complex scene change
types (such as object addition/removal/state change, event description,
environment change, etc.) that require models to imagine many different
scenarios and reason about the changed scenes. We present a baseline model
based on a vision-language Transformer (i.e., LXMERT) and ablation studies.
Through human evaluation, we demonstrate a large human-model performance gap,
suggesting room for promising future work on this challenging counterfactual,
scene imagination task. Our code and dataset are publicly available at:
https://github.com/hyounghk/CoSIm
Related papers
- The Change You Want to See (Now in 3D) [65.61789642291636]
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene.
We contribute a change detection model that is trained entirely on synthetic data and is class-agnostic.
We release a new evaluation dataset consisting of real-world image pairs with human-annotated differences.
arXiv Detail & Related papers (2023-08-21T01:59:45Z) - Neural Scene Chronology [79.51094408119148]
We aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination, and time.
In this work, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions.
arXiv Detail & Related papers (2023-06-13T17:59:58Z) - Structured Generative Models for Scene Understanding [4.5053219193867395]
This paper argues for the use of emphstructured generative models (SGMs) for the understanding of static scenes.
The SGM approach has the merits that it is compositional and generative, which lead to interpretability and editability.
Perhaps the most challenging problem for SGMs is emphinference of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images.
arXiv Detail & Related papers (2023-02-07T15:23:52Z) - Finding Differences Between Transformers and ConvNets Using
Counterfactual Simulation Testing [82.67716657524251]
We present a counterfactual framework that allows us to study the robustness of neural networks with respect to naturalistic variations.
Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers.
arXiv Detail & Related papers (2022-11-29T18:59:23Z) - RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - One-Shot Neural Fields for 3D Object Understanding [112.32255680399399]
We present a unified and compact scene representation for robotics.
Each object in the scene is depicted by a latent code capturing geometry and appearance.
This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction, and stable grasp prediction.
arXiv Detail & Related papers (2022-10-21T17:33:14Z) - Hallucinating Pose-Compatible Scenes [55.064949607528405]
We present a large-scale generative adversarial network for pose-conditioned scene generation.
We curating a massive meta-dataset containing over 19 million frames of humans in everyday environments.
We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose.
arXiv Detail & Related papers (2021-12-13T18:59:26Z) - Stochastic Scene-Aware Motion Prediction [41.6104600038666]
We present a novel data-driven, synthesis motion method that models different styles of performing a given action with a target object.
Our method, called SAMP, for SceneAware Motion Prediction, generalizes to target objects of various geometries while enabling the character to navigate in cluttered scenes.
arXiv Detail & Related papers (2021-08-18T17:56:17Z) - GIRAFFE: Representing Scenes as Compositional Generative Neural Feature
Fields [45.21191307444531]
Deep generative models allow for photorealistic image synthesis at high resolutions.
But for many applications, this is not enough: content creation also needs to be controllable.
Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
arXiv Detail & Related papers (2020-11-24T14:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.