Fugu-MT 論文翻訳(概要): AFUN: Towards an Affordance Foundation Model for Functionality Understanding

論文の概要: AFUN: Towards an Affordance Foundation Model for Functionality Understanding

arxiv url: http://arxiv.org/abs/2606.02551v1
Date: Mon, 01 Jun 2026 17:50:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.552576
Title: AFUN: Towards an Affordance Foundation Model for Functionality Understanding
Title（参考訳）: AFUN:機能理解のためのAffordance Foundation Modelを目指して
Authors: Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao,
Abstract要約: 我々は,機能理解のための手頃な基礎モデルに向けたステップとして,我々のモデルを提示する。我々は、異種ロボット、人間、シミュレーション、現実世界のスキャンデータを共有価格スキーマに変換する大規模な標準化データパイプラインを構築します。私たちのモデルは、4つのベンチマークから8つのテストセットにまたがる大きなマージンで、すべてのベースラインを上回ります。
参考スコア（独自算出の注目度）: 12.890216832485647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN
Abstract（参考訳）: Affordance understandingは視覚的知覚と身体的動作を橋渡しし、オープンで非構造化された現実世界環境でロボットを操作するための説明可能なインターフェースとして機能する。しかし、インタラクションの場所や方法だけでなく、さまざまな環境、オブジェクト、タスクにまたがる一般化も理解する余裕基盤モデルを構築することは、長年にわたる研究課題である。既存の手法では、実行可能動作を指定せずにタスク関連領域をローカライズしたり、動作を予測するが、拡張性は限られている。本稿では,機能理解のための手頃な基礎モデルに向けてのステップとして,我々のモデルを提案する。一つのRGB-D観測と言語タスク記述から、我々のモデルはタスク条件付き機能マスク(相互作用の場所)と3D後運動曲線(相互作用の仕方)を予測する。オープンワールドの一般化を支援するため,異種ロボット,人間,シミュレーション,実世界のスキャンデータを,言語,マスク,オブジェクト中心の3Dモーションラベルを備えた共有価格スキーマに変換する,大規模な標準化データパイプラインを構築した。我々のモデルは,価格セグメンテーションにおいて,平均gIoU/cIoU+23.9/+26.3で,平均gIoU/cIoUを+23.9/+26.3で改善した。私たちのモデルは、ロボットの具体化やタスク固有のヒューリスティックの使用を微調整することなく、現実世界のロボット操作のためにデプロイできます。プロジェクトページ:https://www.zhaoningwang.com/AFUN

論文の概要: AFUN: Towards an Affordance Foundation Model for Functionality Understanding

関連論文リスト