Fugu-MT 論文翻訳(概要): Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

論文の概要: Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

arxiv url: http://arxiv.org/abs/2605.17336v1
Date: Sun, 17 May 2026 09:09:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.906391
Title: Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms
Title（参考訳）: 身体知能における触覚に基づくマルチモーダルフュージョン:視覚・言語・接触駆動パラダイムに関する調査
Authors: Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun, Shaofeng Liang, Daizong Liu, Tao Huang, Yutao Yue, Henghui Ding, Bin Fang, Alex Zhou, Qing-Long Han, Hui Xiong,
Abstract要約: 本稿では,フィールドを2つの主次元(マルチモーダルデータセットとマルチモーダルメソッド)に分類する階層型分類法を提案する。データ側では、Tactile-Visionデータセット、Tactile-Languageデータセット、Tactile-Vision-Languageデータセット、Tactile-Vision-Otherデータセットを含むリソースを分類する。提案手法は,(1)マルチモーダル認識・認識,(2)クロスモーダル生成,(2)触覚・視覚・テキスト間の双方向翻訳,(3)マルチモーダルインタラクション,フィードバック制御と言語誘導操作の3つの柱に先行して構成する。
参考スコア（独自算出の注目度）: 70.51538670020267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.
Abstract（参考訳）: 触覚はインテリジェンスを具現化するための基本的なモダリティであり、接触形状、材料特性、リモートセンサーが置き換えられない相互作用のダイナミクスに独特で直接的なフィードバックを提供する。しかし、一様触覚知覚は、その空間範囲が狭いことと、グローバルな意味的文脈の欠如によって本質的に制限されている。近年のディープラーニングと大規模言語モデルの爆発により、触覚と視覚と言語の統合は、意味論的推論と物理的相互作用をブリッジするために不可欠となり、マルチモーダル触覚融合の出現につながった。急速な進歩にもかかわらず、既存の研究は異なるデータセットで断片化され、モダリティやタスクを感知し、統一された理論的枠組みが欠如している。このギャップに対処するため,本稿では,2026年の第1四半期までの多モード触覚融合研究を包括的に調査する。本稿では,フィールドを2つの主次元(マルチモーダルデータセットとマルチモーダルメソッド)に分類する階層型分類法を提案する。データ側では、Tactile-Visionデータセット、Tactile-Languageデータセット、Tactile-Vision-Languageデータセット、Tactile-Vision-Otherデータセットを含むリソースを分類する。提案手法では,(1)物体の理解と認識に焦点を当てたマルチモーダル認識,(2)触覚,視覚,テキスト間の双方向翻訳に着目したクロスモーダル生成,(3)フィードバック制御と言語誘導操作を重視したマルチモーダルインタラクション,の3つの柱に事前作業を構成する。さらに、代表的な触覚センサハードウェアを要約し、一般的に使用されている評価指標とベンチマーク設定をレビューし、現在の課題と将来的な方向性について論じる。

論文の概要: Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

関連論文リスト