Fugu-MT 論文翻訳(概要): 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

論文の概要: 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

arxiv url: http://arxiv.org/abs/2506.22242v1
Date: Fri, 27 Jun 2025 14:09:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-30 21:12:23.227863
Title: 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration
Title（参考訳）: 4D-VLA : 時間空間の視覚・言語・行動予測とクロスシーン校正
Authors: Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang,
Abstract要約: 既存の手法は通常、単純な観察を入力としてデータセットのアクション分布をモデル化する。カオスの源泉への入力に、4D情報を効果的に統合する新しいアプローチである4D-VLAを提案する。我々のモデルは既存の手法を常に上回り、より強い空間的理解と適応性を示す。
参考スコア（独自算出の注目度）: 31.111439909825627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
Abstract（参考訳）: さまざまなロボットデータを事前訓練に活用することは、依然として重要な課題だ。既存の手法は通常、単純な観察を入力としてデータセットのアクション分布をモデル化する。しかし、これらの入力はしばしば不完全であり、分散された条件付き行動分布をもたらす。この矛盾は、事前訓練の効率を著しく損なう。そこで本研究では,これらのカオスの原因を緩和するために,入力に4D情報を効果的に統合する4D-VLAを提案する。我々のモデルは,ロボットとシーンの座標系を整列させて,連続的なRGB-D入力で視覚特徴に深度と時間情報を導入する。このアライメントにより、トレーニングオーバーヘッドを最小限に抑えながら、強力な時空間推論能力を持つモデルが実現される。さらに,過去の画像から情報フレームを抽出するフレームサンプリング手法として,メモリバンクサンプリングを導入する。実験結果から,プレトレーニング手法とアーキテクチャコンポーネントがモデル性能を大幅に向上することが確認された。シミュレーションと実世界の両方の実験において、我々のモデルはOpenVLAよりも成功率を大幅に向上させる。空間知覚と新しい視点への一般化をさらに評価するために,多視点シミュレーションベンチマークであるMV-Benchを導入する。我々のモデルは既存の手法を常に上回り、より強い空間的理解と適応性を示す。

論文の概要: 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

関連論文リスト