Related papers: FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

URL: http://arxiv.org/abs/2503.04919v1
Date: Thu, 06 Mar 2025 19:34:15 GMT
Title: FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
Authors: Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, Alireza Fathi,
Abstract summary: Multimodal Large Language Models (MLLMs) excel at semantic tasks, but their application to 3D scene generation is hindered by their limited grounding on 3D geometry.<n>We introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense.
Score: 42.2054752179292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.

Related papers

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning [82.55361351483005]
We present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data.<n>By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth.
arXiv Detail & Related papers (2026-02-27T11:47:45Z)
Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding [22.218083641125137]
Existing 3D scene understanding methods mostly distill language features from 2D foundation models into 3D feature fields.<n>We propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry.<n>Our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
arXiv Detail & Related papers (2026-02-17T17:10:13Z)
SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations [112.29763628638112]
Object-X is a versatile multi-modal 3D representation framework.<n>It can encoding rich object embeddings and decoding them back into geometric and visual reconstructions.<n>It supports a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization.
arXiv Detail & Related papers (2025-06-05T09:14:42Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation. We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.<n>We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z)
Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver [11.69164802295844]
We introduce a new framework that integrates visual features, geometric formal language, and natural language representations. We propose a novel synthetic data approach and create a large-scale geometric dataset, SynthGeo228K, annotated with both formal and natural language captions. Our framework improves MLLMs' ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset.
arXiv Detail & Related papers (2024-09-06T12:11:06Z)
Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction [14.225228781008209]
This paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision.
arXiv Detail & Related papers (2024-08-28T08:02:47Z)
LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes. We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z)
Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.