VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
- URL: http://arxiv.org/abs/2602.20608v1
- Date: Tue, 24 Feb 2026 07:00:38 GMT
- Title: VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
- Authors: Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, Ying He,
- Abstract summary: 3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI)<n>Most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions.<n>We introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision.
- Score: 31.566690411188244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.
Related papers
- VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model [29.52176445302312]
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation.<n>We propose VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities.<n>Our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities.
arXiv Detail & Related papers (2026-02-10T10:36:57Z) - REACT3D: Recovering Articulations for Interactive Physical 3D Scenes [96.27769519526426]
REACT3D is a framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry.<n>We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.
arXiv Detail & Related papers (2025-10-13T12:37:59Z) - LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding [15.944945244005952]
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes.<n>We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination.
arXiv Detail & Related papers (2025-05-07T02:02:15Z) - GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.<n>We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.