Fugu-MT 論文翻訳(概要): FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

論文の概要: FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

arxiv url: http://arxiv.org/abs/2601.08246v2
Date: Thu, 12 Mar 2026 07:24:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.760897
Title: FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models
Title（参考訳）: FSAG:拡散モデルによる人-指-指-指比重グラウンドの強化
Authors: Yifan Han, Yichuan Peng, Pengfei Yi, Junyan Li, Hanqing Wang, Gaojing Zhang, Qi Peng Liu, Wenzhao Lian,
Abstract要約: デクサラスグリップ合成は機能的意図と物理的実現性を満たす必要があるが、既存のパイプラインはしばしば洗練からセマンティックグラウンドを分離する。本研究では、事前学習された生成拡散モデルにおいて、対象中心のセマンティックな事前情報を活用することで、ロボットの把握データ収集を回避できるデータ効率フレームワークを提案する。この結果は,人間の実演と事前学習した生成モデルによって駆動される,スケーラブルでハードウェアに依存しないデキスタラスな操作への道のりを浮き彫りにした。
参考スコア（独自算出の注目度）: 11.581489292735418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dexterous grasp synthesis must jointly satisfy functional intent and physical feasibility, yet existing pipelines often decouple semantic grounding from refinement, yielding unstable or non-functional contacts under object and pose variations. This challenge is exacerbated by the high dimensionality and kinematic diversity of multi-fingered hands, which makes many methods rely on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials. We propose a data-efficient framework that bypasses robot grasp data collection by exploiting object-centric semantic priors in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. We further incorporate these affordance regions into the grasp refinement objective, explicitly guiding each fingertip toward its predicted region during optimization. The resulting system produces stable, human-intuitive multi-contact grasps across common objects and tools, while exhibiting strong generalization to previously unseen object instances within a category, pose variations, and multiple hand embodiments.This work (i) introduces a semantic affordance extraction pipeline leveraging vision--language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.
Abstract（参考訳）: デクサラス・グリップ合成は機能的意図と物理的実現可能性を共同で満たさなければならないが、既存のパイプラインはしばしば洗練からセマンティックグラウンドを分離し、不安定または非機能的接触をオブジェクトの下に生じ、変動を生じさせる。この課題は、マルチフィンガーハンドの高次元性とキネマティックな多様性によって悪化し、多くの手法がシミュレーションや高価な実世界の試行を通じて収集された大きなハードウェア固有の把握データセットに依存している。本研究では、事前学習された生成拡散モデルにおいて、対象中心のセマンティックな事前情報を活用することで、ロボットの把握データ収集を回避できるデータ効率フレームワークを提案する。生映像から時間的整列ときめ細かな把握能力を抽出し, 深度画像から3次元シーン形状を融合させて意味的接点を推定する。さらに,これらの余剰領域を把握精度向上目標に組み入れ,最適化中に各指先を予測領域に向けて明示的に誘導する。得られたシステムは、一般的なオブジェクトやツールをまたいだ安定した、人間の直感的なマルチコンタクトグリップを生成すると同時に、カテゴリ内の未確認オブジェクトインスタンスに強力な一般化を示し、バリエーションを呈し、複数の手体を具現化する。 (i)視覚・言語生成先行情報を活用した意味的余剰抽出パイプラインを導入し,デクスタラスな把握を行う。 (II)ハードウェア固有の把握データセットを構築することなく、クロスハンドの一般化を実証し、 3) 基礎モデル意味論と組み合わせた場合, 単一深さのモダリティが, 高速なグリップ合成に十分であることを示す。この結果は,人間の実演と事前学習した生成モデルによって駆動される,スケーラブルでハードウェアに依存しないデキスタラスな操作への道のりを浮き彫りにした。

論文の概要: FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models

関連論文リスト