Fugu-MT 論文翻訳(概要): XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

論文の概要: XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

arxiv url: http://arxiv.org/abs/2604.18484v1
Date: Mon, 20 Apr 2026 16:37:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.998265
Title: XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
Title（参考訳）: XEmbodied:大規模身体環境のための幾何学的および物理的キューを拡張した基礎モデル
Authors: Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang, Siwen Jiao, Sicong Jiang, Zilin Huang, Yunlong Wang, Kun Jiang, Mengmeng Yang, Hao Ye, Guanghao Zhang, Hangjun Ye, Guang Chen, Long Chen, Diange Yang,
Abstract要約: クラウドパイプラインは、幾何学的推論やドメインセマンティクスに欠ける汎用視覚言語モデル(VLM)に依存している。我々は,本質的な3次元幾何学的認識を伴うVLMを実現するクラウドサイド基盤モデルであるXEmbodiedを提案する。 XEmbodiedは18の公開ベンチマークで堅牢なパフォーマンスを示しながら、一般的な能力を保っている。
参考スコア（独自算出の注目度）: 26.90783926543698
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは次世代の自律システムを駆動するが、それらを訓練するには複雑な環境からスケーラブルで高品質なアノテーションが必要である。現在のクラウドパイプラインは、幾何学的推論やドメインセマンティクスに欠ける汎用視覚言語モデル(VLM)に依存している。このミスマッチに対処するため、本研究では、VLMに固有の幾何学的認識と物理的手がかり(例えば、占有グリッド、3Dボックス)を付与するクラウドサイド基盤モデルであるXEmbodiedを提案する。 XEmbodiedは、幾何学を補助的な入力として扱う代わりに、構造化された3Dアダプタを介して幾何学的表現を統合し、効率的なイメージエンボディードアダプタを使用して物理信号をコンテキストトークンに蒸留する。プログレッシブなドメインカリキュラムと強化学習ポストトレーニングを通じて、XEmbodiedは18の公開ベンチマークで堅牢なパフォーマンスを示しながら、一般的な能力を保っている。大規模なシナリオマイニングとVQAのための空間的推論、トラフィックセマンティクス、具体的価格、アウト・オブ・ディストリビューションの一般化を著しく改善する。

論文の概要: XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

関連論文リスト