FuguReport

RLDX-1 Technical Report

Authors Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim, Beomjun Kim, Byungjun Yoon, Changsung Jang, Daewon Choi, Dongsu Han, Donguk Lee, Heeseung Kwon, Hojin Jeon, Jaehyun Kang, Jaekyoung Bae, Jihyuk Lee, Jimin Lee, John Won, Joonwoo Ahn, Junhyeong Park, Junyoung Sung, Kyungmin Lee, Minseong Han, Minsung Yoon, Sejune Joo, Seonil Son, Seungcheol Park, Seunggeun Cho, Seungjun Moon, Seungku Kim, Yonghoon Dong, Yongjin Cho, Youngchan Kim, Chang Hwan Kim, Dohyeon Kim, Hazel Lee, Heecheol Kim, Hensen Ahn, Hyungkyu Ryu, Hyunsoo Choi, Hyunsoo Shin, Jaeheon Jung, Jaewoo Kim, Jinwook Kim, Joochul Chang, Joonsoo Kim, Junghun Park, Jungwoo Park, Junho Cho, Junhyeok Park, Junwon Lee, Kangwook Lee, Kwanghoon Kim, Kyoungwhan Choe, Manoj Bhadu, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Seunghoon Shim, Seunghyun Kim, Seungjun Lee, Seungyup Ka, Sungryol Yang, Wook Jung, Yashu Shukla, Yeonjae Lee, Yeonwoo Bae, Jinwoo Shin
Affiliations Korea Advanced Institute of Science and Technology / RLWRLD
Categories Method / Robot Policy / General dexterous manipulation policy, Application / Robotics Control / Dexterous robot manipulation, Evaluation / Model Evaluation / Performance comparison with vision-language action models
License CC BY 4.0

Abstract Overview

RLDX-1 is a general-purpose vision-language-action (VLA) policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT). It addresses capabilities that prior VLAs lack—motion awareness, long-term memory, and physical sensing—by combining a temporally aware VLM with a motion module, an explicit memory module, and a physics stream for torque and tactile inputs. The broader framework includes synthetic data generation with motion-consistency filtering, a three-stage training pipeline (pre-training, embodiment-specific mid-training, and task-specific post-training), and inference optimizations for real-time deployment. The paper evaluates the model across simulation benchmarks and real-world humanoid (OpenArm, ALLEX) and single-arm (Franka Research 3) platforms, comparing against recent VLA baselines including π0.5 and GR00T N1.6.

Novelty

The paper's main novelty is a unified VLA architecture (MSAT) that explicitly integrates motion awareness, long-term memory, and physical sensing within one action transformer through modality-specific streams coupled by joint self-attention, rather than treating these as isolated add-ons. It also combines this architecture with a synthetic data curation pipeline featuring motion-consistency filtering, a three-stage training procedure, and deployment-oriented inference optimization in a single end-to-end robotics framework.

Results

Across simulation benchmarks, RLDX-1 consistently outperforms reported VLA baselines, including on challenging settings such as GR-1 Tabletop (58.7% vs. 47.6% for GR00T N1.6) and RoboCasa365 (32.1% average vs. 26.9% for the next best). In real-world ALLEX humanoid experiments, RLDX-1 reaches 86.8% overall success while π0.5 and GR00T N1.6 achieve around 40%, and it achieves 91.7% on the Object-in-Box Selection task requiring long-term memory. Inference latency is reduced from 71.2 ms to 43.7 ms (1.63× speedup) for the all-modality model through static graph conversion and kernel optimization.

Key Points

  1. RLDX-1 centers on MSAT, which processes heterogeneous modalities through modality-specific streams (cognition, action, physics) coupled via joint self-attention, supporting action generation from visual, language, proprioceptive, memory, and physical-sensing inputs.
  2. The training recipe combines large-scale public robot data, in-house humanoid and FR3 data, and filtered synthetic robot videos, with pre-training, embodiment-specific mid-training, and task-specific post-training including optional reinforcement learning via a text-based VLM critic.
  3. Empirical results in both simulation and real-world benchmarks show that RLDX-1 is especially stronger than recent VLAs on tasks requiring motion awareness, long-term memory, or physical feedback, while also outperforming baselines on standard versatile-intelligence benchmarks.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.