Implicit Affordance Acquisition via Causal Action-Effect Modeling in the
Video Domain
- URL: http://arxiv.org/abs/2312.11345v1
- Date: Mon, 18 Dec 2023 16:51:26 GMT
- Title: Implicit Affordance Acquisition via Causal Action-Effect Modeling in the
Video Domain
- Authors: Hsiu-Yu Yang and Carina Silberer
- Abstract summary: Recent findings indicate that world knowledge emerges through large-scale self-supervised pretraining.
We propose two novel pretraining tasks promoting the acquisition of two affordance properties in models.
We empirically demonstrate the effectiveness of our proposed methods in learning affordance properties.
- Score: 5.188825486231326
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent
findings indicate that world knowledge emerges through large-scale
self-supervised pretraining, motivating our exploration of acquiring affordance
knowledge from the visual domain. To this end, we augment an existing
instructional video resource to create the new Causal Action-Effect (CAE)
dataset and design two novel pretraining tasks -- Masked Action Modeling (MAM)
and Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance
properties in models: behavior and entity equivalence, respectively. We
empirically demonstrate the effectiveness of our proposed methods in learning
affordance properties. Furthermore, we show that a model pretrained on both
tasks outperforms a strong image-based visual-linguistic foundation model
(FLAVA) as well as pure linguistic models on a zero-shot physical reasoning
probing task.
Related papers
- Are Visual-Language Models Effective in Action Recognition? A Comparative Study [22.97135293252601]
This paper provides a large-scale study and insight on state-of-the-art vision foundation models.
It compares their transfer ability onto zero-shot and frame-wise action recognition tasks.
Experiments are conducted on recent fine-grained, human-centric action recognition datasets.
arXiv Detail & Related papers (2024-10-22T16:28:21Z) - Learning to Visually Connect Actions and their Effects [14.733204402684215]
We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding.
CATE can have applications in areas like task planning and learning from demonstration.
We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos.
arXiv Detail & Related papers (2024-01-19T16:48:49Z) - Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond [69.64364187449773]
Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training.
We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more.
We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
arXiv Detail & Related papers (2023-12-31T12:03:21Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Generative Model-based Feature Knowledge Distillation for Action
Recognition [11.31068233536815]
Our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model.
The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets.
arXiv Detail & Related papers (2023-12-14T03:55:29Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - Learning Task Informed Abstractions [10.920599910769276]
We propose learning Task Informed Abstractions (TIA) that explicitly separates reward-correlated visual features from distractors.
TIA leads to significant performance gains over state-of-the-art methods on many visual control tasks.
arXiv Detail & Related papers (2021-06-29T17:56:11Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.