Fugu-MT 論文翻訳(概要): ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

論文の概要: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

arxiv url: http://arxiv.org/abs/2601.08325v1
Date: Tue, 13 Jan 2026 08:29:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.763167
Title: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Title（参考訳）: ActiveVLA:精密3次元ロボットマニピュレーションのための視覚・言語・アクションモデルへのアクティブ知覚注入
Authors: Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu,
Abstract要約: ActiveVLAは視覚言語によるアクションフレームワークで、ロボットに高い精度できめ細かな操作を可能にする。我々は,ActiveVLAが3つのシミュレーションベンチマークで高精度な3D操作を実現し,最先端のベースラインを上回っていることを示す。
参考スコア（独自算出の注目度）: 52.94334113271359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
Abstract（参考訳）: ロボット操作の最近の進歩は、事前学習された視覚言語モデル(VLM)を活用し、これらのモデルに3次元空間信号を統合して効果的な行動予測を行い、将来性のある視覚言語行動(VLA)パラダイムを生み出している。しかし、既存のほとんどのアプローチは、アクティブな知覚の重要性を軽視している。その結果、これらのモデルはタスク実行中に最適な視点や解像度を適応的に選択することができず、長い水平タスクやきめ細かい操作シナリオにおけるパフォーマンスを著しく制限する。これらの制約に対処するため,我々は,ロボットに高精度できめ細かな操作を可能にする視覚言語アクションフレームワークであるActiveVLAを提案する。 ActiveVLAは粗大なパラダイムを採用し、プロセスを2段階に分割する。 ActiveVLAは、マルチビュー2Dプロジェクションに3D入力を投影し、重要な3D領域を特定し、動的空間認識をサポートする。 (2)能動的知覚最適化局所臨界領域に基づいて、ActiveVLAは最適な視点を選択するためにアクティブなビュー選択戦略を使用する。これらの視点は、オクルージョンを最小化しながら、アモーダルな関連性と多様性を最大化することを目的としている。さらに、ActiveVLAはキー領域の解像度を改善するために3Dズームインを適用している。これらのステップは、精密な操作のためのよりきめ細かい能動的知覚を可能にする。大規模な実験により、ActiveVLAは正確な3D操作を実現し、3つのシミュレーションベンチマークで最先端のベースラインを上回ります。さらに、ActiveVLAは現実のシナリオにシームレスに移行し、ロボットが複雑な環境で高精度なタスクを学習できるようにする。

論文の概要: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

関連論文リスト