Modular Action Concept Grounding in Semantic Video Prediction
- URL: http://arxiv.org/abs/2011.11201v4
- Date: Tue, 26 Apr 2022 13:31:26 GMT
- Title: Modular Action Concept Grounding in Semantic Video Prediction
- Authors: Wei Yu, Wenxin Chen, Songhenh Yin, Steve Easterbrook, Animesh Garg
- Abstract summary: We introduce the task of semantic action-conditional video prediction, which uses semantic action labels to describe interactions.
Inspired by the idea of Mixture of Experts, we embody each abstract label by a structured combination of various visual concept learners.
Our method is evaluated on two newly designed synthetic datasets and one real-world dataset.
- Score: 28.917125574895422
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent works in video prediction have mainly focused on passive forecasting
and low-level action-conditional prediction, which sidesteps the learning of
interaction between agents and objects. We introduce the task of semantic
action-conditional video prediction, which uses semantic action labels to
describe those interactions and can be regarded as an inverse problem of action
recognition. The challenge of this new task primarily lies in how to
effectively inform the model of semantic action information. Inspired by the
idea of Mixture of Experts, we embody each abstract label by a structured
combination of various visual concept learners and propose a novel video
prediction model, Modular Action Concept Network (MAC). Our method is evaluated
on two newly designed synthetic datasets, CLEVR-Building-Blocks and
Sapien-Kitchen, and one real-world dataset called Tower-Creation. Extensive
experiments demonstrate that MAC can correctly condition on given instructions
and generate corresponding future frames without need of bounding boxes. We
further show that the trained model can make out-of-distribution
generalization, be quickly adapted to new object categories and exploit its
learnt features for object detection, showing the progression towards
higher-level cognitive abilities. More visualizations can be found at
http://www.pair.toronto.edu/mac/.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - CoProNN: Concept-based Prototypical Nearest Neighbors for Explaining Vision Models [1.0855602842179624]
We present a novel approach that enables domain experts to quickly create concept-based explanations for computer vision tasks intuitively via natural language.
The modular design of CoProNN is simple to implement, it is straightforward to adapt to novel tasks and allows for replacing the classification and text-to-image models.
We show that our strategy competes very well with other concept-based XAI approaches on coarse grained image classification tasks and may even outperform those methods on more demanding fine grained tasks.
arXiv Detail & Related papers (2024-04-23T08:32:38Z) - Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks.
Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training.
This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z) - VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.