Related papers: Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

URL: http://arxiv.org/abs/2505.12278v1
Date: Sun, 18 May 2025 07:33:31 GMT
Title: Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning
Authors: Zhengyi Luo, Chen Tessler, Toru Lin, Ye Yuan, Tairan He, Wenli Xiao, Yunrong Guo, Gal Chechik, Kris Kitani, Linxi Fan, Yuke Zhu,
Abstract summary: We introduce Perceptive Dexterous Control (PDC), a framework for vision-driven whole-body control with simulated humanoids.<n>PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues.<n>We show that training from scratch with reinforcement learning can produce emergent behaviors such as active search.
Score: 69.71072181304066
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.

Related papers

Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting. M3L learns a policy and visual-tactile representations based on masked autoencoding. We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z)
Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders. Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
Visual-Tactile Multimodality for Following Deformable Linear Objects Using Reinforcement Learning [15.758583731036007]
We study the problem of using vision and tactile inputs together to complete the task of following deformable linear objects. We create a Reinforcement Learning agent using different sensing modalities and investigate how its behaviour can be boosted. Our experiments show that the use of both vision and tactile inputs, together with proprioception, allows the agent to complete the task in up to 92% of cases.
arXiv Detail & Related papers (2022-03-31T21:59:08Z)
Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects. Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input. We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z)
Learning Visually Guided Latent Actions for Assistive Teleoperation [9.75385535829762]
We develop assistive robots that condition their latent embeddings on visual inputs. We show that incorporating object detectors pretrained on small amounts of cheap, easy-to-collect structured data enables i) accurately recognizing the current context and ii) generalizing control embeddings to new objects and tasks.
arXiv Detail & Related papers (2021-05-02T23:58:28Z)
Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control [67.23580984118479]
Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. Hand-eye Action Networks (HAN) can approximate human's hand-eye coordination behaviors by learning from human teleoperated demonstrations.
arXiv Detail & Related papers (2021-02-28T01:49:13Z)
What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks. Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions. Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.