Learning Action-Effect Dynamics from Pairs of Scene-graphs
- URL: http://arxiv.org/abs/2212.03433v1
- Date: Wed, 7 Dec 2022 03:36:37 GMT
- Title: Learning Action-Effect Dynamics from Pairs of Scene-graphs
- Authors: Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang and Chitta Baral
- Abstract summary: We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
- Score: 50.72283841720014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 'Actions' play a vital role in how humans interact with the world. Thus,
autonomous agents that would assist us in everyday tasks also require the
capability to perform 'Reasoning about Actions & Change' (RAC). Recently, there
has been growing interest in the study of RAC with visual and linguistic
inputs. Graphs are often used to represent semantic structure of the visual
content (i.e. objects, their attributes and relationships among objects),
commonly referred to as scene-graphs. In this work, we propose a novel method
that leverages scene-graph representation of images to reason about the effects
of actions described in natural language. We experiment with existing CLEVR_HYP
(Sampat et. al, 2021) dataset and show that our proposed approach is effective
in terms of performance, data efficiency, and generalization capability
compared to existing models.
Related papers
- Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding [3.336126457178601]
We propose Disentangled VIdeo representations of Action and Scene (DEVIAS) to achieve holistic video understanding.
Our proposed method shows favorable performance across different datasets compared to the baselines.
arXiv Detail & Related papers (2023-11-30T18:58:44Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Efficient Multi-Modal Embeddings from Structured Data [0.0]
Multi-modal word semantics aims to enhance embeddings with perceptual input.
Visual grounding can contribute to linguistic applications as well.
New embedding conveys complementary information for text based embeddings.
arXiv Detail & Related papers (2021-10-06T08:42:09Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.