RLDX-1 Technical Report
Abstract Overview
RLDX-1 is a general-purpose vision-language-action (VLA) policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT). It addresses capabilities that prior VLAs lack—motion awareness, long-term memory, and physical sensing—by combining a temporally aware VLM with a motion module, an explicit memory module, and a physics stream for torque and tactile inputs. The broader framework includes synthetic data generation with motion-consistency filtering, a three-stage training pipeline (pre-training, embodiment-specific mid-training, and task-specific post-training), and inference optimizations for real-time deployment. The paper evaluates the model across simulation benchmarks and real-world humanoid (OpenArm, ALLEX) and single-arm (Franka Research 3) platforms, comparing against recent VLA baselines including π0.5 and GR00T N1.6.
Novelty
The paper's main novelty is a unified VLA architecture (MSAT) that explicitly integrates motion awareness, long-term memory, and physical sensing within one action transformer through modality-specific streams coupled by joint self-attention, rather than treating these as isolated add-ons. It also combines this architecture with a synthetic data curation pipeline featuring motion-consistency filtering, a three-stage training procedure, and deployment-oriented inference optimization in a single end-to-end robotics framework.
Results
Across simulation benchmarks, RLDX-1 consistently outperforms reported VLA baselines, including on challenging settings such as GR-1 Tabletop (58.7% vs. 47.6% for GR00T N1.6) and RoboCasa365 (32.1% average vs. 26.9% for the next best). In real-world ALLEX humanoid experiments, RLDX-1 reaches 86.8% overall success while π0.5 and GR00T N1.6 achieve around 40%, and it achieves 91.7% on the Object-in-Box Selection task requiring long-term memory. Inference latency is reduced from 71.2 ms to 43.7 ms (1.63× speedup) for the all-modality model through static graph conversion and kernel optimization.
Key Points
- RLDX-1 centers on MSAT, which processes heterogeneous modalities through modality-specific streams (cognition, action, physics) coupled via joint self-attention, supporting action generation from visual, language, proprioceptive, memory, and physical-sensing inputs.
- The training recipe combines large-scale public robot data, in-house humanoid and FR3 data, and filtered synthetic robot videos, with pre-training, embodiment-specific mid-training, and task-specific post-training including optional reinforcement learning via a text-based VLM critic.
- Empirical results in both simulation and real-world benchmarks show that RLDX-1 is especially stronger than recent VLAs on tasks requiring motion awareness, long-term memory, or physical feedback, while also outperforming baselines on standard versatile-intelligence benchmarks.