Related papers: CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

URL: http://arxiv.org/abs/2501.08982v2
Date: Mon, 03 Feb 2025 10:49:47 GMT
Title: CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation
Authors: Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel,
Abstract summary: We introduce a method to generate distributions of camera poses conditioned on textual descriptions.<n>Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations.<n>We validate our method's superiority by comparing it against standard distribution estimation methods across five large-scale datasets.
Score: 99.23408146027462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations, with conditional signals derived from pre-trained text encoders. Integration with the pretrained Vision-Language Model, CLIP, establishes a strong linkage between text descriptions and pose distributions. Enhancement of localization accuracy is achieved by rendering candidate poses using 3D Gaussian splatting, which corrects misaligned samples through visual reasoning. We validate our method's superiority by comparing it against standard distribution estimation methods across five large-scale datasets, demonstrating consistent outperformance. Code, datasets and more information will be publicly available at our project page.

Related papers

A Guide to Structureless Visual Localization [63.41481414949785]
Methods that estimate the camera pose of a query image in a known scene are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. This paper is dedicated to providing, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods.
arXiv Detail & Related papers (2025-04-24T15:08:36Z)
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [63.99937807085461]
3D occupancy prediction provides a comprehensive description of the surrounding scenes. Most existing methods focus on offline perception from one or a few views. We formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it.
arXiv Detail & Related papers (2024-12-05T17:57:09Z)
Language Driven Occupancy Prediction [11.208411421996052]
We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy prediction. Our pipeline presents a feasible way to dig into the valuable semantic information of images. LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume.
arXiv Detail & Related papers (2024-11-25T03:47:10Z)
LoGS: Visual Localization via Gaussian Splatting with Fewer Training Images [7.363332481155945]
This paper presents a vision-based localization pipeline utilizing the 3D Splatting (GS) technique as scene representation. During the mapping phase, structure-from-motion (SfM) is applied first, followed by the generation of a GS map. High-precision pose is achieved through the analysis-by manner on the map.
arXiv Detail & Related papers (2024-10-15T11:17:18Z)
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts. We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z)
WorDepth: Variational Language Prior for Monocular Depth Estimation [47.614203035800735]
We investigate whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. We focus on monocular depth estimation, the problem of predicting a dense depth map from a single image. Our approach is trained alternatingly between the text and image branches.
arXiv Detail & Related papers (2024-04-04T17:54:33Z)
3DGS-ReLoc: 3D Gaussian Splatting for Map Representation and Visual ReLocalization [13.868258945395326]
This paper presents a novel system designed for 3D mapping and visual relocalization using 3D Gaussian Splatting. Our proposed method uses LiDAR and camera data to create accurate and visually plausible representations of the environment.
arXiv Detail & Related papers (2024-03-17T23:06:12Z)
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z)
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Compositional 3D Scene Generation using Locally Conditioned Diffusion [49.5784841881488]
We introduce textbflocally conditioned diffusion as an approach to compositional scene diffusion. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
arXiv Detail & Related papers (2023-03-21T22:37:16Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Learning and Matching Multi-View Descriptors for Registration of Point Clouds [48.25586496457587]
We first propose a multi-view local descriptor, which is learned from the images of multiple views, for the description of 3D keypoints. Then, we develop a robust matching approach, aiming at rejecting outlier matches based on the efficient inference. We have demonstrated the boost of our approaches to registration on the public scanning and multi-view stereo datasets.
arXiv Detail & Related papers (2018-07-16T01:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.