Fugu-MT 論文翻訳(概要): TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

論文の概要: TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

arxiv url: http://arxiv.org/abs/2603.12665v1
Date: Fri, 13 Mar 2026 05:20:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.746792
Title: TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation
Title（参考訳）: TacVLA:ロバスト・ビジョン・ランゲージ・アクション・マニピュレーションのためのコンタクト対応触覚融合
Authors: Kaidi Zhang, Heng Zhang, Zhengtong Xu, Zhiyuan Zhang, Md Rakibul Islam Prince, Xiang Li, Xiaojing Han, Yuhao Zhou, Arash Ajoudani, Yu She,
Abstract要約: VLA(Vision-Language-Action)モデルは、ロボット操作において大きな優位性を示している。本稿では,触覚モーダルを変換器のポリシーに組み込んだ微調整VLAモデルTacVLAを提案する。本稿では,接触検出時にのみ触覚トークンを選択的に活性化する接触認識ゲーティング機構を提案する。
参考スコア（独自算出の注目度）: 27.000763540977506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボット操作において大きな優位性を示している。しかしながら、視覚と言語への依存は、視覚的閉塞、きめ細かい操作、物理的接触を含むタスクにおいて、最適以下のパフォーマンスをもたらすことが多い。これらの課題に対処するために,触覚モーダルをトランスフォーマーベースのポリシーに組み込んだ微調整VLAモデルであるTacVLAを提案する。具体的には,接触検出時にのみ触覚トークンを選択的に活性化する接触認識ゲーティング機構を導入し,無関係な触覚干渉を回避しつつ,適応的な多モード融合を実現する。融合した視覚、言語、触覚トークンはトランスフォーマーアーキテクチャ内で共同で処理され、接触とリッチな相互作用の際のクロスモーダルグラウンドリングを強化する。拘束ロック付き解体、箱内ピック、ロバストネス評価の広範囲な実験により、我々のモデルはベースラインを上回り、分解における20%の成功率、箱内ピックの60%、視覚的閉塞を伴うシナリオの2.1倍の精度で性能を向上させた。ビデオはhttps://sites.google.com/view/tacvlaで公開されている。

論文の概要: TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

関連論文リスト