Related papers: A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping

A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping

URL: http://arxiv.org/abs/2406.11786v1
Date: Mon, 17 Jun 2024 17:39:30 GMT
Title: A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping
Authors: Abhi Kamboj, Katherine Driggs-Campbell,
Abstract summary: Robotic grasping presents a difficult motor task in real-world scenarios. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms. We investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance.
Score: 4.7079226008262145
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Robotic grasping presents a difficult motor task in real-world scenarios, constituting a major hurdle to the deployment of capable robots across various industries. Notably, the scarcity of data makes grasping particularly challenging for learned models. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms predicated on massive amounts of data sourced from the Internet, and now nearly all prominent models leverage pretrained backbone networks. Against this backdrop, we begin to investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance. This preliminary literature review sheds light on critical challenges and delineates prospective directions for future research in visual pretraining for robotic manipulation.

Related papers

Is Single-View Mesh Reconstruction Ready for Robotics? [63.29645501232935]
This paper evaluates single-view mesh reconstruction models for creating digital twin environments in robot manipulation.<n>We establish benchmarking criteria for 3D reconstruction in robotics contexts.<n>Despite success on computer vision benchmarks, existing approaches fail to meet robotics-specific requirements.
arXiv Detail & Related papers (2025-05-23T14:35:56Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck. Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z)
Semantically Controllable Augmentations for Generalizable Robot Learning [40.89398799604755]
Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets.
arXiv Detail & Related papers (2024-09-02T05:25:34Z)
Towards Generalist Robot Learning from Internet Video: A Survey [56.621902345314645]
We present an overview of the emerging field of Learning from Videos (LfV) LfV aims to address the robotics data bottleneck by augmenting traditional robot data with large-scale internet video data. We provide a review of current methods for extracting knowledge from large-scale internet video, addressing key challenges in LfV, and boosting downstream robot and reinforcement learning via the use of video data.
arXiv Detail & Related papers (2024-04-30T15:57:41Z)
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents [109.3804962220498]
AutoRT is a system to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
arXiv Detail & Related papers (2024-01-23T18:45:54Z)
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods [14.780597545674157]
We investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives. We propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning.
arXiv Detail & Related papers (2023-08-07T14:24:52Z)
Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks. One of the key contributing factors to this progress is the scale of robot data used to train the models. We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.