Fugu-MT 論文翻訳(概要): Tactile Modality Fusion for Vision-Language-Action Models

論文の概要: Tactile Modality Fusion for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.14604v1
Date: Sun, 15 Mar 2026 20:57:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.914002
Title: Tactile Modality Fusion for Vision-Language-Action Models
Title（参考訳）: 視覚・言語・行動モデルのための触覚モダリティ融合
Authors: Charlotte Morissette, Amin Abyaneh, Wei-Di Chang, Anas Houssaini, David Meger, Hsiu-Chin Lin, Jonathan Tremblay, Gregory Dudek,
Abstract要約: 本稿では,視覚触覚信号と視覚言語アクション(VLA)モデルを統合する軽量なモーダルフュージョンアプローチであるTacFiLMを提案する。その結果, 成功率, 直接挿入性能, 完了時間, 負荷安定性の両面において一貫した改善が見られた。
参考スコア（独自算出の注目度）: 22.788833830429766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.
Abstract（参考訳）: 本稿では,視覚触覚信号と視覚言語アクション(VLA)モデルを統合する軽量なモダリティ融合手法であるTacFiLMを提案する。 VLAモデルの最近の進歩は、一般化可能かつ意味論的基盤を持つロボットポリシーを導入しているが、これらのモデルは主に視覚に基づく知覚に依存している。しかし、視覚だけでは、接触力、表面摩擦、コンプライアンス、せん断など、コンタクトリッチな操作中に起こる複雑な相互作用のダイナミクスを捉えることはできない。 VLAモデルに触覚信号を統合する最近の試みは、トークンの連結や大規模な事前訓練によって複雑さを増大させるが、行動モデルの重い計算要求はより軽量な融合戦略を必要とする。これらの課題に対処するため、TacFiLMは、FiLM (Feature-wise linear modulation) を用いて事前訓練された触覚表現の中間的な視覚的特徴を条件付ける、訓練後の微調整アプローチを概説した。挿入作業における実験結果は, 成功率, 直接挿入性能, 完了時間, 出力安定性の両面において一貫した改善が見られた。これらの結果は,VLAモデルに触覚信号を統合するための効果的な手法として,我々の手法を裏付けるものである。

論文の概要: Tactile Modality Fusion for Vision-Language-Action Models

関連論文リスト