Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale
- URL: http://arxiv.org/abs/2506.12009v1
- Date: Fri, 13 Jun 2025 17:57:18 GMT
- Title: Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale
- Authors: Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, Minsu Cho,
- Abstract summary: We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
- Score: 41.693908591580175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato
Related papers
- SDMatte: Grafting Diffusion Models for Interactive Matting [16.575733536011658]
We propose a diffusion-driven interactive matting model, SDMatte, with three key contributions.<n>First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability.<n>Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information.<n>Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance.
arXiv Detail & Related papers (2025-08-01T09:00:48Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
Interactions [23.296139146133573]
We present a large-scale dataset, invig, for interactive visual grounding under language ambiguity.
Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues.
To the best of our knowledge, the invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding.
arXiv Detail & Related papers (2023-10-18T17:57:05Z) - Tri-level Joint Natural Language Understanding for Multi-turn
Conversational Datasets [5.3361357265365035]
We present a novel tri-level joint natural language understanding approach, adding domain, and explicitly exchange semantic information between all levels.
We evaluate our model on two multi-turn datasets for which we are the first to conduct joint slot-filling and intent detection.
arXiv Detail & Related papers (2023-05-28T13:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.