Learning Action-Effect Dynamics from Pairs of Scene-graphs
- URL: http://arxiv.org/abs/2212.03433v1
- Date: Wed, 7 Dec 2022 03:36:37 GMT
- Title: Learning Action-Effect Dynamics from Pairs of Scene-graphs
- Authors: Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang and Chitta Baral
- Abstract summary: We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
- Score: 50.72283841720014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 'Actions' play a vital role in how humans interact with the world. Thus,
autonomous agents that would assist us in everyday tasks also require the
capability to perform 'Reasoning about Actions & Change' (RAC). Recently, there
has been growing interest in the study of RAC with visual and linguistic
inputs. Graphs are often used to represent semantic structure of the visual
content (i.e. objects, their attributes and relationships among objects),
commonly referred to as scene-graphs. In this work, we propose a novel method
that leverages scene-graph representation of images to reason about the effects
of actions described in natural language. We experiment with existing CLEVR_HYP
(Sampat et. al, 2021) dataset and show that our proposed approach is effective
in terms of performance, data efficiency, and generalization capability
compared to existing models.
Related papers
- Situational Scene Graph for Structured Human-centric Situation Understanding [15.91717913059569]
We propose a graph-based representation called Situational Scene Graph (SSG) to encode both humanobject-relationships and the corresponding semantic properties.
The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action.
We will release the code and the dataset soon.
arXiv Detail & Related papers (2024-10-30T09:11:25Z) - Text-Enhanced Zero-Shot Action Recognition: A training-free approach [13.074211474150914]
We propose Text-Enhanced Action Recognition (TEAR) for zero-shot video action recognition.
TEAR is training-free and does not require the availability of training data or extensive computational resources.
arXiv Detail & Related papers (2024-08-29T10:20:05Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.