Related papers: HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

URL: http://arxiv.org/abs/2603.02329v1
Date: Mon, 02 Mar 2026 19:00:37 GMT
Title: HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
Authors: Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu, Lap-Pui Chau,
Abstract summary: Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects.<n>Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction-driven 3D affordance grounding, namely HAMMER.
Score: 20.107744857311292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

Related papers

REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting [16.896443736904356]
Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions.<n>We introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation.<n>Our framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer.
arXiv Detail & Related papers (2025-10-18T08:53:08Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [45.68105299990119]
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets.<n>We propose a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD.
arXiv Detail & Related papers (2025-03-10T17:55:22Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation [13.614206918726314]
We propose techniques to enhance the model's ability to localize and disambiguate target objects.<n>Our approach achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity.
arXiv Detail & Related papers (2024-12-09T16:04:32Z)
Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation [64.07560335451723]
CoSER is a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D. It achieves both efficiency and quality by meticulously learning neighbor-view coherence. It aggregates information along motion paths explicitly defined by physical principles to refine details.
arXiv Detail & Related papers (2024-08-23T15:16:01Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [130.40123493752816]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.<n>Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)<n>It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z)
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data. We propose a TransRefer3D network to extract entity-and-relation aware multimodal context. Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.