Fugu-MT 論文翻訳(概要): VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

論文の概要: VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

arxiv url: http://arxiv.org/abs/2604.03322v1
Date: Thu, 02 Apr 2026 09:24:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.510391
Title: VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing
Title（参考訳）: VitaTouch:製造におけるロボット品質検査のための特性認識型視覚触覚言語モデル
Authors: Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li, Zihang Lv, Gang Chen, Fang Deng,
Abstract要約: VitaTouchは、マテリアルプロパティ推論と自然言語属性記述のための視覚触覚言語モデルである。我々は,186個のオブジェクト,52k画像,5.1k個の人間検証型インストラクション・アンサー・ペアを用いたマルチモーダルデータセットを構築した。 VitaTouchはHCTとTVLベンチマークで最高のパフォーマンスを達成しているが、SSVTPでは競争力を維持している。
参考スコア（独自算出の注目度）: 15.446632940347122
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/
Abstract（参考訳）: スマートマニュファクチャリングにおける品質検査は、目に見える幾何学を超えて固有の材料や表面特性を特定する必要があるが、視覚のみの手法は隠蔽や反射に弱いままである。本稿では,プロパティ認識型視覚触覚言語モデルであるVitaTouchを提案する。 VitaTouchは、モダリティ固有のエンコーダとデュアルQ-Formerを使用して、言語に関連する視覚的特徴と触覚的特徴を抽出し、大きな言語モデルのためにプレフィックストークンに圧縮する。それぞれのモダリティをテキストと一致させ、視覚と触覚を両立させ、対照的な学習を通して触れる。また、VitaSetは186のオブジェクト、52kのイメージ、および5.1kの人間検証された命令-回答ペアを持つマルチモーダルデータセットである。 VitaTouchはHCTとTVLベンチマークで最高のパフォーマンスを達成しているが、SSVTPでは競争力を維持している。 VitaSetでは、88.89%の硬さの精度、75.13%の粗さの精度、54.81%の記述子リコールを実現している。 LoRAベースの微調整により、VitaTouchは2-、3-、5-カテゴリの欠陥認識においてそれぞれ100.0%、96.0%、92.0%の精度を達成し、100の実験ロボット実験において94.0%のクローズドループ認識精度と94.0%のエンドツーエンドソート成功を提供する。詳細はプロジェクトのページで確認できる。

論文の概要: VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

関連論文リスト