Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task
- URL: http://arxiv.org/abs/2212.03866v1
- Date: Wed, 7 Dec 2022 05:41:58 GMT
- Title: Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task
- Authors: Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang and Chitta Baral
- Abstract summary: We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
- Score: 50.72283841720014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 'Actions' play a vital role in how humans interact with the world. Thus,
autonomous agents that would assist us in everyday tasks also require the
capability to perform 'Reasoning about Actions & Change' (RAC). This has been
an important research direction in Artificial Intelligence (AI) in general, but
the study of RAC with visual and linguistic inputs is relatively recent. The
CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical
vision-language reasoning with actions as the key focus. In this work, we
propose a novel learning strategy that can improve reasoning about the effects
of actions. We implement an encoder-decoder architecture to learn the
representation of actions as vectors. We combine the aforementioned
encoder-decoder architecture with existing modality parsers and a scene graph
question answering model to evaluate our proposed system on the CLEVR_HYP
dataset. We conduct thorough experiments to demonstrate the effectiveness of
our proposed approach and discuss its advantages over previous baselines in
terms of performance, data efficiency, and generalization capability.
Related papers
- Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Towards A Unified Agent with Foundation Models [18.558328028366816]
We investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents.
We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges.
We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets.
arXiv Detail & Related papers (2023-07-18T22:37:30Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Let's Go to the Alien Zoo: Introducing an Experimental Framework to
Study Usability of Counterfactual Explanations for Machine Learning [6.883906273999368]
Counterfactual explanations (CFEs) have gained traction as a psychologically grounded approach to generate post-hoc explanations.
We introduce the Alien Zoo, an engaging, web-based and game-inspired experimental framework.
As a proof of concept, we demonstrate the practical efficacy and feasibility of this approach in a user study.
arXiv Detail & Related papers (2022-05-06T17:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.