Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks
- URL: http://arxiv.org/abs/2008.02265v5
- Date: Fri, 2 Apr 2021 20:12:04 GMT
- Title: Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks
- Authors: Haozhi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma, Jitendra Malik
- Abstract summary: We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
- Score: 75.06423516419862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning long-term dynamics models is the key to understanding physical
common sense. Most existing approaches on learning dynamics from visual input
sidestep long-term predictions by resorting to rapid re-planning with
short-term models. This not only requires such models to be super accurate but
also limits them only to tasks where an agent can continuously obtain feedback
and take action at each step until completion. In this paper, we aim to
leverage the ideas from success stories in visual recognition tasks to build
object representations that can capture inter-object and object-environment
interactions over a long-range. To this end, we propose Region Proposal
Interaction Networks (RPIN), which reason about each object's trajectory in a
latent region-proposal feature space. Thanks to the simple yet effective object
representation, our approach outperforms prior methods by a significant margin
both in terms of prediction quality and their ability to plan for downstream
tasks, and also generalize well to novel environments. Code, pre-trained
models, and more visualization results are available at https://haozhi.io/RPIN.
Related papers
- Object-centric Video Representation for Long-term Action Anticipation [33.115854386196126]
Key motivation is that objects provide important cues to recognize and predict human-object interactions.
We propose to build object-centric video representations by leveraging visual-language pretrained models.
To recognize and predict human-object interactions, we use a Transformer-based neural architecture.
arXiv Detail & Related papers (2023-10-31T22:54:31Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Goal-driven Self-Attentive Recurrent Networks for Trajectory Prediction [31.02081143697431]
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and video-surveillance applications.
We propose a lightweight attention-based recurrent backbone that acts solely on past observed positions.
We employ a common goal module, based on a U-Net architecture, which additionally extracts semantic information to predict scene-compliant destinations.
arXiv Detail & Related papers (2022-04-25T11:12:37Z) - Learning Dual Dynamic Representations on Time-Sliced User-Item
Interaction Graphs for Sequential Recommendation [62.30552176649873]
We devise a novel Dynamic Representation Learning model for Sequential Recommendation (DRL-SRe)
To better model the user-item interactions for characterizing the dynamics from both sides, the proposed model builds a global user-item interaction graph for each time slice.
To enable the model to capture fine-grained temporal information, we propose an auxiliary temporal prediction task over consecutive time slices.
arXiv Detail & Related papers (2021-09-24T07:44:27Z) - Model-Based Reinforcement Learning via Latent-Space Collocation [110.04005442935828]
We argue that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions.
We adapt the idea of collocation, which has shown good results on long-horizon tasks in optimal control literature, to the image-based setting by utilizing learned latent state space models.
arXiv Detail & Related papers (2021-06-24T17:59:18Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts.
We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation.
Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.