Fugu-MT 論文翻訳(概要): VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

論文の概要: VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

arxiv url: http://arxiv.org/abs/2603.23481v1
Date: Tue, 24 Mar 2026 17:45:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.623009
Title: VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
Title（参考訳）: VTAM:VLA以外の複雑な物理相互作用のためのビデオ触覚反応モデル
Authors: Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou,
Abstract要約: Video-Action Models (VAM) は、インテリジェンスを具現化するための有望なフレームワークとして登場した。本稿では,触覚を接地信号として組み込んだマルチモーダル世界モデリングフレームワークであるVideo-Tactile Action Model (VTAM)を紹介する。 VTAMは、触覚ストリームでトレーニング済みのビデオトランスフォーマーを軽量なモダリティ転送ファインタニングで強化する。
参考スコア（独自算出の注目度）: 47.982092015932444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.
Abstract（参考訳）: VAM(Video-Action Models)は、生のビデオストリームから暗黙の世界ダイナミクスを学び、時間的に一貫した行動予測を生成する、インテリジェンスを具現化するための有望なフレームワークとして登場した。このようなモデルは、視覚的推論による長期タスクにおいて強い性能を示すが、重要な相互作用状態が視覚のみから部分的にしか観察できないような、接触に富んだシナリオに限られる。特に、きめ細かい力変調と接触遷移は、視覚トークンに確実にエンコードされておらず、不安定または不正確な振る舞いをもたらす。このギャップを埋めるために,触覚を補完的な接地信号として組み込んだマルチモーダル世界モデリングフレームワークであるVideo-Tactile Action Model (VTAM)を導入する。 VTAMは、触覚ストリームを軽量なモダリティ転送微調整により拡張し、触覚言語ペアデータや独立した触覚事前学習を使わずに、効率的なクロスモーダル表現学習を可能にする。マルチモーダル核融合を安定化させるために,両モード間のバランスの取れた注意を強制する触覚正規化損失を導入し,アクションモデルにおける視覚的潜伏支配を防止した。 VTAMは、接触リッチな操作において優れた性能を示し、平均90%の堅牢な成功率を維持している。ポテトチップスのピック・アンド・プレイスのような挑戦的なシナリオでは、高忠実度力の認識を必要とするが、VTAMはpi 0.5ベースラインを80%上回っている。本研究は,世界行動モデルにおける視覚的推定誤差の補正に触覚フィードバックの統合が不可欠であることを示す。

論文の概要: VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

関連論文リスト