Observe Then Act: Asynchronous Active Vision-Action Model for Robotic   Manipulation
        - URL: http://arxiv.org/abs/2409.14891v2
- Date: Tue, 1 Oct 2024 15:31:23 GMT
- Title: Observe Then Act: Asynchronous Active Vision-Action Model for Robotic   Manipulation
- Authors: Guokang Wang, Hang Li, Shuyuan Zhang, Yanhong Liu, Huaping Liu, 
- Abstract summary: Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning.
This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.
The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.
- Score: 13.736566979493613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks. 
 
      
        Related papers
        - Learning Video Generation for Robotic Manipulation with Collaborative   Trajectory Control [72.00655365269]
 We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
 arXiv  Detail & Related papers  (2025-06-02T17:57:06Z)
- Learning Coordinated Bimanual Manipulation Policies using State   Diffusion and Inverse Dynamics Models [22.826115023573205]
 We infuse the predictive nature of human manipulation strategies into robot imitation learning.
We train a diffusion model to predict future states and compute robot actions that achieve the predicted states.
Our framework consistently outperforms state-of-the-art state-to-action mapping policies.
 arXiv  Detail & Related papers  (2025-03-30T01:25:35Z)
- R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active   Inference and World Models [50.19174067263255]
 We introduce prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments.
We show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.
 arXiv  Detail & Related papers  (2024-09-21T18:32:44Z)
- RoboKoop: Efficient Control Conditioned Representations from Visual   Input in Robotics using Koopman Operator [14.77553682217217]
 We introduce a Contrastive Spectral Koopman Embedding network that allows us to learn efficient linearized visual representations from the agent's visual data in a high dimensional latent space.
Our method enhances stability and control in gradient dynamics over time, significantly outperforming existing approaches.
 arXiv  Detail & Related papers  (2024-09-04T22:14:59Z)
- Bridging Language, Vision and Action: Multimodal VAEs in Robotic   Manipulation Tasks [0.0]
 In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.
We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.
Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
 arXiv  Detail & Related papers  (2024-04-02T13:25:16Z)
- Predictive Experience Replay for Continual Visual Control and
  Forecasting [62.06183102362871]
 We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
 arXiv  Detail & Related papers  (2023-03-12T05:08:03Z)
- Active Exploration for Robotic Manipulation [40.39182660794481]
 This paper proposes a model-based active exploration approach that enables efficient learning in sparse-reward robotic manipulation tasks.
We evaluate our proposed algorithm in simulation and on a real robot, trained from scratch with our method.
 arXiv  Detail & Related papers  (2022-10-23T18:07:51Z)
- H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
  Object Articulations from Interactions [62.510951695174604]
 "Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
 arXiv  Detail & Related papers  (2022-10-22T18:39:33Z)
- Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
 We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
 arXiv  Detail & Related papers  (2020-12-30T23:59:09Z)
- Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
 One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
 arXiv  Detail & Related papers  (2020-07-14T16:42:59Z)
- Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
  Primitives [89.34229413345541]
 We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
 arXiv  Detail & Related papers  (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.