KINet: Unsupervised Forward Models for Robotic Pushing Manipulation
- URL: http://arxiv.org/abs/2202.09006v3
- Date: Sat, 5 Aug 2023 21:39:42 GMT
- Title: KINet: Unsupervised Forward Models for Robotic Pushing Manipulation
- Authors: Alireza Rezazadeh, Changhyun Choi
- Abstract summary: We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation.
Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system.
By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
- Score: 8.572983995175909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric representation is an essential abstraction for forward
prediction. Most existing forward models learn this representation through
extensive supervision (e.g., object class and bounding box) although such
ground-truth information is not readily accessible in reality. To address this,
we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised
framework to reason about object interactions based on a keypoint
representation. Using visual observations, our model learns to associate
objects with keypoint coordinates and discovers a graph representation of the
system as a set of keypoint embeddings and their relations. It then learns an
action-conditioned forward model using contrastive estimation to predict future
keypoint states. By learning to perform physical reasoning in the keypoint
space, our model automatically generalizes to scenarios with a different number
of objects, novel backgrounds, and unseen object geometries. Experiments
demonstrate the effectiveness of our model in accurately performing forward
prediction and learning plannable object-centric representations for downstream
robotic pushing manipulation tasks.
Related papers
- Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Object-centric Video Representation for Long-term Action Anticipation [33.115854386196126]
Key motivation is that objects provide important cues to recognize and predict human-object interactions.
We propose to build object-centric video representations by leveraging visual-language pretrained models.
To recognize and predict human-object interactions, we use a Transformer-based neural architecture.
arXiv Detail & Related papers (2023-10-31T22:54:31Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z) - Flexible Networks for Learning Physical Dynamics of Deformable Objects [2.567499374977917]
We propose a model named time-wise PointNet (TP-Net) to infer the future state of a deformable object with particle-based representation.
TP-Net consists of a shared feature extractor that extracts global features from each input point set in parallel and a prediction network that aggregates and reasons on these features for future prediction.
Experiments demonstrate that our model achieves state-of-the-art performance in both synthetic dataset and in real-world dataset, with real-time prediction speed.
arXiv Detail & Related papers (2021-12-07T14:34:52Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z) - SORNet: Spatial Object-Centric Representations for Sequential
Manipulation [39.88239245446054]
Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state.
We propose SORNet, which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest.
arXiv Detail & Related papers (2021-09-08T19:36:29Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Multi-Modal Learning of Keypoint Predictive Models for Visual Object
Manipulation [6.853826783413853]
Humans have impressive generalization capabilities when it comes to manipulating objects in novel environments.
How to learn such body schemas for robots remains an open problem.
We develop an self-supervised approach that can extend a robot's kinematic model when grasping an object from visual latent representations.
arXiv Detail & Related papers (2020-11-08T01:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.