Fugu-MT 論文翻訳(概要): TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

論文の概要: TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

arxiv url: http://arxiv.org/abs/2605.13083v1
Date: Wed, 13 May 2026 06:54:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.861265
Title: TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video
Title（参考訳）: TouchAnything:エゴセントリックビデオからの双方向触覚推定のためのデータセットとフレームワーク
Authors: Jianyi Zhou, Ziteng Gao, Feiyang Hong, Zirui Liu, Guannan Zhang, Weisheng Dai, Ruichen Zhen, Chuqiao Lyu, Haotian Wu, Yinian Mao, Xushi Wang, Yuxiang Jiang, Wenbo Ding, Shuo Yang,
Abstract要約: EgoTouchは大規模なエゴセントリックなデータセットで、手動オブジェクトのインタラクションを厳密な触覚で監視する。 TouchAnythingは、自我中心のビューを主入力として使用する、視覚と触覚の予測フレームワークである。データセット、コード、ベンチマークを公開します。
参考スコア（独自算出の注目度）: 20.373348802426143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.
Abstract（参考訳）: エゴセントリックな人間のビデオデータは、リッチな人間と環境の相互作用を捉え、大規模に収集することができる。しかし、既存のエゴセントリックなデータセットは触覚センサーが欠如しており、人間と物体の相互作用における接触、力、圧力に関する直接的な手がかりを提供する。このような信号がなければ、モデルは現実世界の相互作用力学の物理的基礎的な表現を学ぶのに苦労する。触覚センサーはこれらの手がかりを提供するが、高品質な触覚ハードウェアを大規模に展開することは、高価で面倒だ。これは、触覚フィードバックを視覚的な観察から直接推測し、エゴセントリックなビデオデータに対するスケーラブルな触覚の監視を可能にし、物理的に基礎を成す学習をサポートすることができるかという、中心的な疑問を提起する。この方向の研究を可能にするために,多視点エゴセントリックな大規模データセットであるEgoTouchを紹介した。 EgoTouchは、さまざまな屋内および屋外環境で1,891エピソードにわたる208の操作タスクで構成されており、同期されたマルチビューRGB(ヘッドマウントのエゴシックカメラとデュアル手首マウントカメラ)、双方向の3Dハンドポーズ、ウェアラブルの触覚センサーからの連続的な圧力マップを備えている。 EgoTouch上に構築されているTouchAnythingは,エゴセントリックビューを主入力として使用し,推論時に利用可能な手首マウントビューを柔軟に活用する,ベースラインの多視点視覚-タッチ予測フレームワークである。実験により、手首に装着したビューを組み込むことで、エゴセントリックな入力よりも触覚の予測が向上し、コンタクトIoUの5.0%、ボリュームIoUの6.1%の相対的な改善が達成された。データセット、コード、ベンチマークを公開します。

論文の概要: TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

関連論文リスト