Learning to Visually Connect Actions and their Effects
- URL: http://arxiv.org/abs/2401.10805v3
- Date: Fri, 26 Jul 2024 16:00:07 GMT
- Title: Learning to Visually Connect Actions and their Effects
- Authors: Paritosh Parmar, Eric Peh, Basura Fernando,
- Abstract summary: We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding.
CATE can have applications in areas like task planning and learning from demonstration.
We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos.
- Score: 14.733204402684215
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection (AS) and Effect-Affinity Assessment (EAA), where video understanding models connect actions and effects at semantic and fine-grained levels, respectively. We design various baseline models for AS and EAA. Despite the intuitive nature of the task, we observe that models struggle, and humans outperform them by a large margin. Our experiments show that in solving AS and EAA, models learn intuitive properties like object tracking and pose encoding without explicit supervision. We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos. The study aims to showcase the fundamental nature and versatility of CATE, with the hope of inspiring advanced formulations and models.
Related papers
- ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - Implicit Affordance Acquisition via Causal Action-Effect Modeling in the
Video Domain [5.188825486231326]
Recent findings indicate that world knowledge emerges through large-scale self-supervised pretraining.
We propose two novel pretraining tasks promoting the acquisition of two affordance properties in models.
We empirically demonstrate the effectiveness of our proposed methods in learning affordance properties.
arXiv Detail & Related papers (2023-12-18T16:51:26Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Modular Action Concept Grounding in Semantic Video Prediction [28.917125574895422]
We introduce the task of semantic action-conditional video prediction, which uses semantic action labels to describe interactions.
Inspired by the idea of Mixture of Experts, we embody each abstract label by a structured combination of various visual concept learners.
Our method is evaluated on two newly designed synthetic datasets and one real-world dataset.
arXiv Detail & Related papers (2020-11-23T04:12:22Z) - MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects.
Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.