Grounding Physical Concepts of Objects and Events Through Dynamic Visual
Reasoning
- URL: http://arxiv.org/abs/2103.16564v1
- Date: Tue, 30 Mar 2021 17:59:48 GMT
- Title: Grounding Physical Concepts of Objects and Events Through Dynamic Visual
Reasoning
- Authors: Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua
B. Tenenbaum, Chuang Gan
- Abstract summary: We present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language.
DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these presentations for answering queries.
DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training.
- Score: 84.90458333884443
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the problem of dynamic visual reasoning on raw videos. This is a
challenging problem; currently, state-of-the-art models often require dense
supervision on physical object properties and events from simulation, which are
impractical to obtain in real life. In this paper, we present the Dynamic
Concept Learner (DCL), a unified framework that grounds physical objects and
events from video and language. DCL first adopts a trajectory extractor to
track each object over time and to represent it as a latent, object-centric
feature vector. Building upon this object-centric representation, DCL learns to
approximate the dynamic interaction among objects using graph networks. DCL
further incorporates a semantic parser to parse questions into semantic
programs and, finally, a program executor to run the program to answer the
question, levering the learned dynamics model. After training, DCL can detect
and associate objects across the frames, ground visual properties, and physical
events, understand the causal relationship between events, make future and
counterfactual predictions, and leverage these extracted presentations for
answering queries. DCL achieves state-of-the-art performance on CLEVRER, a
challenging causal video reasoning dataset, even without using ground-truth
attributes and collision labels from simulations for training. We further test
DCL on a newly proposed video-retrieval and event localization dataset derived
from CLEVRER, showing its strong generalization capacity.
Related papers
- 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in
Clutter [14.489086924126253]
This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes.
Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes.
We propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn synthesis grasp directly from image-text pairs.
arXiv Detail & Related papers (2023-11-09T22:55:10Z) - Kinematic-aware Prompting for Generalizable Articulated Object
Manipulation with LLMs [53.66070434419739]
Generalizable articulated object manipulation is essential for home-assistant robots.
We propose a kinematic-aware prompting framework that prompts Large Language Models with kinematic knowledge of objects to generate low-level motion waypoints.
Our framework outperforms traditional methods on 8 categories seen and shows a powerful zero-shot capability for 8 unseen articulated object categories.
arXiv Detail & Related papers (2023-11-06T03:26:41Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric
Models [30.313085784715575]
We introduce SlotFormer -- a Transformer-based autoregressive model on learned object-temporal representations.
In this paper, we successfully apply SlotFormer to perform prediction on datasets with complex object interactions.
We also show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
arXiv Detail & Related papers (2022-10-12T01:53:58Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.