CRAVES: Controlling Robotic Arm with a Vision-based Economic System
- URL: http://arxiv.org/abs/1812.00725v3
- Date: Sat, 31 May 2025 16:10:58 GMT
- Title: CRAVES: Controlling Robotic Arm with a Vision-based Economic System
- Authors: Yiming Zuo, Weichao Qiu, Lingxi Xie, Fangwei Zhong, Yizhou Wang, Alan L. Yuille,
- Abstract summary: Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry.<n>This work discusses the role of computer vision algorithms in this field.<n>We present an alternative solution, which uses a 3D model to create a large number of synthetic data.
- Score: 96.56564257199474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry. This work discusses the role of computer vision algorithms in this field. We focus on low-cost arms on which no sensors are equipped and thus all decisions are made upon visual recognition, e.g., real-time 3D pose estimation. This requires annotating a lot of training data, which is not only time-consuming but also laborious. In this paper, we present an alternative solution, which uses a 3D model to create a large number of synthetic data, trains a vision model in this virtual domain, and applies it to real-world images after domain adaptation. To this end, we design a semi-supervised approach, which fully leverages the geometric constraints among keypoints. We apply an iterative algorithm for optimization. Without any annotations on real images, our algorithm generalizes well and produces satisfying results on 3D pose estimation, which is evaluated on two real-world datasets. We also construct a vision-based control system for task accomplishment, for which we train a reinforcement learning agent in a virtual environment and apply it to the real-world. Moreover, our approach, with merely a 3D model being required, has the potential to generalize to other types of multi-rigid-body dynamic systems. Website: https://qiuwch.github.io/craves.ai. Code: https://github.com/zuoym15/craves.ai
Related papers
- Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation [14.013652439013692]
This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots.<n>We achieve this by designing an accurate residual-aware EE tracking policy.<n>We use this accurate end-effector tracker to build a modular system for loco-manipulation.
arXiv Detail & Related papers (2026-02-18T18:55:02Z) - PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation [48.807071017228964]
We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows.<n>With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation.<n>We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation.
arXiv Detail & Related papers (2026-01-07T10:29:12Z) - Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z) - 4D Visual Pre-training for Robot Learning [71.22906081161324]
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years.<n>However, these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world.<n>We are seeking a general visual pre-training framework that could improve all 3D representations as an alternative.<n>Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning.
arXiv Detail & Related papers (2025-08-24T07:06:56Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks [19.026406684039006]
Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models to learn the mapping between RGB images, language instructions, and joint space control.<n>In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model.<n>Our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$%$.
arXiv Detail & Related papers (2025-05-09T05:32:40Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - NSLF-OL: Online Learning of Neural Surface Light Fields alongside
Real-time Incremental 3D Reconstruction [0.76146285961466]
The paper proposes a novel Neural Surface Light Fields model that copes with the small range of view directions while producing a good result in unseen directions.
Our model learns online the Neural Surface Light Fields (NSLF) aside from real-time 3D reconstruction with a sequential data stream as the shared input.
In addition to online training, our model also provides real-time rendering after completing the data stream for visualization.
arXiv Detail & Related papers (2023-04-29T15:41:15Z) - Markerless Camera-to-Robot Pose Estimation via Self-supervised
Sim-to-Real Transfer [26.21320177775571]
We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method.
Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable.
arXiv Detail & Related papers (2023-02-28T05:55:42Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.