Evaluating Robustness of Visual Representations for Object Assembly Task
Requiring Spatio-Geometrical Reasoning
- URL: http://arxiv.org/abs/2310.09943v3
- Date: Tue, 6 Feb 2024 20:17:13 GMT
- Title: Evaluating Robustness of Visual Representations for Object Assembly Task
Requiring Spatio-Geometrical Reasoning
- Authors: Chahyon Ku, Carl Winge, Ryan Diaz, Wentao Yuan, Karthik Desingh
- Abstract summary: This paper focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks.
We employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders.
Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations.
- Score: 8.626019848533707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper primarily focuses on evaluating and benchmarking the robustness of
visual representations in the context of object assembly tasks. Specifically,
it investigates the alignment and insertion of objects with geometrical
extrusions and intrusions, commonly referred to as a peg-in-hole task. The
accuracy required to detect and orient the peg and the hole geometry in SE(3)
space for successful assembly poses significant challenges. Addressing this, we
employ a general framework in visuomotor policy learning that utilizes visual
pretraining models as vision encoders. Our study investigates the robustness of
this framework when applied to a dual-arm manipulation setup, specifically to
the grasp variations. Our quantitative analysis shows that existing pretrained
models fail to capture the essential visual features necessary for this task.
However, a visual encoder trained from scratch consistently outperforms the
frozen pretrained models. Moreover, we discuss rotation representations and
associated loss functions that substantially improve policy learning. We
present a novel task scenario designed to evaluate the progress in visuomotor
policy learning, with a specific focus on improving the robustness of intricate
assembly tasks that require both geometrical and spatial reasoning. Videos,
additional experiments, dataset, and code are available at
https://bit.ly/geometric-peg-in-hole .
Related papers
- AugInsert: Learning Robust Visual-Force Policies via Data Augmentation for Object Assembly Tasks [7.631503105866245]
This paper primarily focuses on learning robust visual-force policies in the context of high-precision object assembly tasks.
We aim to learn contact-rich manipulation policies with multisensory inputs on limited expert data by expanding human demonstrations via online data augmentation.
arXiv Detail & Related papers (2024-10-19T04:19:52Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - PEEKABOO: Hiding parts of an image for unsupervised object localization [7.161489957025654]
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information.
We propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization.
The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision.
arXiv Detail & Related papers (2024-07-24T20:35:20Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Unadversarial Examples: Designing Objects for Robust Vision [100.4627585672469]
We develop a framework that exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects"
We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks to (in-simulation) robotics.
arXiv Detail & Related papers (2020-12-22T18:26:07Z) - Embodied Visual Active Learning for Semantic Segmentation [33.02424587900808]
We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding.
We develop a battery of agents - both learnt and pre-specified - and with different levels of knowledge of the environment.
We extensively evaluate the proposed models using the Matterport3D simulator and show that a fully learnt method outperforms comparable pre-specified counterparts.
arXiv Detail & Related papers (2020-12-17T11:02:34Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.