SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
- URL: http://arxiv.org/abs/2412.01550v1
- Date: Mon, 02 Dec 2024 14:37:57 GMT
- Title: SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
- Authors: Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang,
- Abstract summary: 3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations.
Existing efforts typically adhere to single-object, single-affordance paradigms.
We introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions.
- Score: 46.773958062464004
- License:
- Abstract: 3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.
Related papers
- CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds [1.9643285694999641]
We propose Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds.
CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks.
arXiv Detail & Related papers (2025-01-07T15:42:32Z) - Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis [39.85109385954641]
Recent works demonstrate the performances of 3D point cloud recognition can be boosted remarkably by parameter-efficient prompt tuning.
We present a comprehensive regulation framework that allows the learnable prompts to actively interact with the well-learned general knowledge in large 3D models.
Surprisingly, our method not only realizes consistently increasing generalization ability but also enhances task-specific 3D recognition performances across various 3DDG benchmarks by a clear margin.
arXiv Detail & Related papers (2024-10-27T10:35:47Z) - 3D-GRES: Generalized 3D Referring Expression Segmentation [77.10044505645064]
3D Referring Expression (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description.
Generalized 3D Referring Expression (3D-GRES) extends the capability to segment any number of instances based on natural language instructions.
arXiv Detail & Related papers (2024-07-30T08:59:05Z) - Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects.
We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands.
We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z) - Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks.
We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes.
arXiv Detail & Related papers (2024-05-27T17:59:41Z) - Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability.
Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder.
Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for
End-to-End 3D Referring Expression Segmentation [33.20461146674787]
In 3D Referring Expression (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions.
We introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights.
Our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times.
arXiv Detail & Related papers (2023-08-31T11:00:03Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.