Related papers: Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

URL: http://arxiv.org/abs/2107.03996v1
Date: Thu, 8 Jul 2021 17:41:55 GMT
Title: Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
Authors: Ruihan Yang, Minghao Zhang, Nicklas Hansen, Huazhe Xu, Xiaolong Wang
Abstract summary: We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) We introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion.
Score: 14.509254362627576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments. Our project page with videos is at https://RchalYang.github.io/LocoTransformer .

Related papers

Accelerating Transformers in Online RL [47.99822253865053]
transformer-based models in Reinforcement Learning (RL)<n>We propose a method that uses the Accelerator policy as a transformer's trainer.<n>We show that applying our algorithm not only enables stable training of transformers but also reduces training time on image-based environments by up to a factor of two.
arXiv Detail & Related papers (2025-09-30T11:57:14Z)
No More Blind Spots: Learning Vision-Based Omnidirectional Bipedal Locomotion for Challenging Terrain [12.51464645002418]
We present a learning framework for vision-based omnidirectional bipedal locomotion.<n>A key challenge is the high computational cost of rendering omnidirectional depth images in simulation.<n>Our method combines a robust blind controller with a teacher policy that supervises a vision-based student policy.
arXiv Detail & Related papers (2025-08-16T06:20:46Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
MCRL4OR: Multimodal Contrastive Representation Learning for Off-Road Environmental Perception [28.394436093801797]
We propose a Multimodal Contrastive Representation Learning approach for Off-Road environmental perception, namely MCRL4OR. This approach aims to jointly learn three encoders for processing visual images, locomotion states, and control actions. In experiments, we pre-train the MCRL4OR with a large-scale off-road driving dataset and adopt the learned multimodal representations for various downstream perception tasks in off-road driving scenarios.
arXiv Detail & Related papers (2025-01-23T08:27:15Z)
Offline Adaptation of Quadruped Locomotion using Diffusion Models [59.882275766745295]
We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification and minimal compute overhead. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.
arXiv Detail & Related papers (2024-11-13T18:12:15Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z)
Learning to Walk by Steering: Perceptive Quadrupedal Locomotion in Dynamic Environments [25.366480092589022]
A quadrupedal robot must exhibit robust and agile walking behaviors in response to environmental clutter and moving obstacles. We present a hierarchical learning framework, named PRELUDE, which decomposes the problem of perceptive locomotion into high-level decision-making. We demonstrate the effectiveness of our approach in simulation and with hardware experiments.
arXiv Detail & Related papers (2022-09-19T17:55:07Z)
An Adaptable Approach to Learn Realistic Legged Locomotion without Examples [38.81854337592694]
This work proposes a generic approach for ensuring realism in locomotion by guiding the learning process with the spring-loaded inverted pendulum model as a reference. We present experimental results showing that even in a model-free setup, the learned policies can generate realistic and energy-efficient locomotion gaits for a bipedal and a quadrupedal robot.
arXiv Detail & Related papers (2021-10-28T10:14:47Z)
Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization [9.014518402531875]
We train the RL policy for end-to-end control in a physical simulator without any predefined controller or reference motion. We demonstrate the robot can smoothly maneuver at a high speed, avoid the obstacles, and show significant improvement over the baselines.
arXiv Detail & Related papers (2021-09-29T16:48:05Z)
Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual Observations [75.60524561611008]
This work aims to exploit the use of sparse visual observations to achieve perceptual locomotion over a range of commonly seen bumps, ramps, and stairs in human-centred environments. We first formulate the selection of minimal visual input that can represent the uneven surfaces of interest, and propose a learning framework that integrates such exteroceptive and proprioceptive data. We validate the learned policy in tasks that require omnidirectional walking over flat ground and forward locomotion over terrains with obstacles, showing a high success rate.
arXiv Detail & Related papers (2021-09-28T20:25:10Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
RLOC: Terrain-Aware Legged Locomotion using Reinforcement Learning and Optimal Control [6.669503016190925]
We present a unified model-based and data-driven approach for quadrupedal planning and control. We map sensory information and desired base velocity commands into footstep plans using a reinforcement learning policy. We train and evaluate our framework on a complex quadrupedal system, ANYmal B, and demonstrate transferability to a larger and heavier robot, ANYmal C, without requiring retraining.
arXiv Detail & Related papers (2020-12-05T18:30:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.