Fugu-MT 論文翻訳(概要): VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

論文の概要: VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

arxiv url: http://arxiv.org/abs/2606.04708v1
Date: Wed, 03 Jun 2026 10:38:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.687426
Title: VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
Title（参考訳）: VISTA:VLAトレーニングのためのUMIデータの視覚的・物理的適応
Authors: Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li,
Abstract要約: Universal Manipulation Interface (UMI)は、ハードウェア固有の遠隔操作なしでスケーラブルな実世界のロボットデータ収集を可能にする。 VISTAは、この2つのギャップを3つの相乗的コンポーネントを通して橋渡しするフレームワークである。我々は,物理検証パイプライン,UMI-VQA,検証された軌道データ,コミュニティのための事前学習モデルをリリースする。
参考スコア（独自算出の注目度）: 52.05483137072975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.
Abstract（参考訳）: ユニバーサル・マニピュレーション・インタフェース(UMI)は、ハードウェア固有の遠隔操作なしにスケーラブルな実世界のロボットデータ収集を可能にするが、大規模なビジョン・ランゲージ・アクション(VLA)モデルをトレーニングするためにUMIデータを活用することは基本的に困難である。手首に装着した魚眼の視線は、高度な放射歪みと局所的なグリップ中心の視線を持ち、事前訓練されたVLMの分布外であり、人為的な軌跡は、しばしば運動的限界に違反する、衝突を誘発する、または制御帯域を超えた、VLAポリシーを物理的に実現不可能な行動で教える、という2つの重要なミスマッチを同定する。この課題に対処するため、我々は3つの相乗的コンポーネントを通してこの2つのギャップを橋渡しするフレームワークであるVISTAを提案する。 (i)—UMI-VQAは、手首に装着した魚眼観察用に設計された最初の大規模VQAデータセットであり、視覚言語による補助的な監督を通して、VLM表現を歪んだ視覚状態に合わせる。 (ii)~ 系統的な物理的検証パイプラインは、データ完全性事前チェックを行い、トレーニングに入る前に、軌道連続性、自己衝突リスク、実行忠実性の各有効な軌跡をスコアする。 (iii)~2段階協調学習レシピは,UMI-VQAに基づく視覚言語基盤と,検証された軌道上での行動予測を共同で学習する。実験により,UMI-VQAを組み込むことで,下流の政策性能が一貫した改善が達成され,物理バリデーションスコアがデプロイメントの成功を強く予測できることが実証された。多様なシミュレーションや実世界の操作タスクにおいて、VISTAは、$π_{0.5}$、LingBot-VLA、Wall-Xなど、強いベースラインを著しく上回る。我々は,物理検証パイプライン,UMI-VQA,検証された軌道データ,コミュニティのための事前学習モデルをリリースする。

論文の概要: VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

関連論文リスト