Fugu-MT 論文翻訳(概要): Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

論文の概要: Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2605.14950v1
Date: Thu, 14 May 2026 15:21:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.909242
Title: Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model
Title（参考訳）: Evo-Depth:軽量深度拡張型ビジョンランゲージ・アクションモデル
Authors: Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, Kai Ye, Yiran Mao, Yilei Zhong, MingKang Dong, Junchi Yan, Gen Li, Bo Zhao,
Abstract要約: Vision-Language-Actionモデルは、認識、言語基盤、アクション生成の統一を約束している。現在のVLAモデルは、深度情報と詳細な空間関係を持たない2次元視覚表現に大きく依存している。 Evo-Depthは、空間的に接地された操作を強化する軽量な深度強化VLAフレームワークである。
参考スコア（独自算出の注目度）: 43.14057937517956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
Abstract（参考訳）: Vision-Language-Actionモデルは、知覚、言語接地、行動生成を統一することでロボット操作のための有望なパラダイムとして登場した。しかしながら、現在のVLAモデルは、深度情報や詳細な空間関係が欠如している2次元視覚表現に依存しているため、正確な空間理解を必要とするシナリオでしばしば苦労する。近年のアプローチでは、この問題に対処するために深度マップや点雲などの明示的な3D入力が組み込まれているが、システムの複雑さを増大させ、追加のセンサーを必要とし、ノイズや再構成エラーの検知に弱いままである。別の研究の行では、RGB観測から直接3D対応空間モデリングを余分なセンサーなしで直接探索するが、しばしば大きな幾何学の基礎モデルに依存し、より高いトレーニングと展開コストをもたらす。これらの課題に対処するために、Evo-Depthを提案する。Evo-Depthは、空間的に接地された操作を、追加のセンサーハードウェアやデプロイメント効率に頼らずに強化する軽量な奥行き強化VLAフレームワークである。 Evo-Depth は、マルチビュー RGB 画像からコンパクトな深度特徴を抽出するために、軽量な Implicit Depth Encoding Module を使用している。これらの特徴は、深度対応変調による空間拡張モジュールを通して視覚言語表現に組み込まれ、効率的な空間意味的拡張を可能にする。結果の深度強調表現と下流行動学習を整合させるために、プログレッシブアライメントトレーニング戦略が導入された。 0.9Bのパラメータしか持たず、Evo-Depthは4つのシミュレーションベンチマークで優れた性能を達成している。実世界の実験では、Evo-Depthは最小のモデルサイズ、最低のGPUメモリ使用量、比較したメソッドの推論頻度も示しながら、最高平均成功率を達成した。

論文の概要: Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

関連論文リスト