Fugu-MT 論文翻訳(概要): Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

論文の概要: Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

arxiv url: http://arxiv.org/abs/2603.08122v1
Date: Mon, 09 Mar 2026 09:02:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.092524
Title: Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA
Title（参考訳）: RL-Augmented TeleoperationとMixture-of-Dexterous-Experts VLAによる人型マニピュレーションに向けて
Authors: Tutian Tang, Xingyu Ji, Wanli Xing, Ce Hao, Wenqiang Xu, Lin Shao, Cewu Lu, Qiaojun Yu, Jiangmiao Pang, Kaifeng Zhang,
Abstract要約: 本稿では,遠隔操作データ収集を簡易化する共有自律型アシスタントIMCopilotを紹介する。我々は、不均一な力と触覚モーダルを予め訓練されたVLAバックボーンにシームレスに統合するアーキテクチャであるMoDE-VLAを提案する。我々は,複雑度を増大させる4つの課題に対するアプローチの有効性を検証し,厳密な接触量の多い課題におけるベースラインよりも2倍の成功率の向上を実証した。
参考スコア（独自算出の注目度）: 62.16042475700567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボット操作において顕著な成功を収めてきたが、その応用は、単純な視覚誘導のピック・アンド・プレイスタスクを実行する低自由度エンドエフェクターに限られている。これらのモデルを人間のような二元的操作、特に接触に富んだ手作業に拡張することは、高忠実性データ取得、マルチスキル学習、マルチモーダル感覚融合において重要な課題を提起する。本稿では,2つのコンポーネント上に構築されたボトルネックに対処する統合フレームワークを提案する。まず、IMCopilot(In-hand Manipulation Copilot)について紹介する。これは強化学習訓練された原子スキルのスイートで、遠隔操作データ収集を簡単にするための共有自律アシスタントとして機能し、VLAの呼び出し可能な低レベル実行プリミティブとして機能する。第2に、不均一な力と触覚モーダルを予め訓練されたVLAバックボーンにシームレスに統合するアーキテクチャであるMoDE-VLA(Mixture-of-Dexterous-Experts VLA)を提案する。残留注入機構を利用することで、MoDE-VLAは、事前訓練されたモデルの知識を劣化させることなく、接触認識の洗練を可能にする。我々は,複雑度を増大させる4つの課題に対するアプローチの有効性を検証し,厳密な接触量の多い課題におけるベースラインよりも2倍の成功率の向上を実証した。

論文の概要: Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

関連論文リスト