HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections
- URL: http://arxiv.org/abs/2404.16845v2
- Date: Sun, 4 Aug 2024 18:51:59 GMT
- Title: HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections
- Authors: Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, Hadar Averbuch-Elor,
- Abstract summary: We present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene.
Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts.
Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks.
- Score: 19.05215193265488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at https://tau-vailab.github.io/HaLo-NeRF/.
Related papers
- Multiview Scene Graph [7.460438046915524]
A proper scene representation is central to the pursuit of spatial intelligence.
We propose to build Multiview Scene Graphs (MSG) from unposed images.
MSG represents a scene topologically with interconnected place and object nodes.
arXiv Detail & Related papers (2024-10-15T02:04:05Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement [32.335953514942474]
This paper proposes to jointly learn the scene representation along with a 3D dense feature field and a 2D feature extractor.
We learn the underlying geometry of the scene with an implicit field through volumetric rendering and design our feature field to leverage intermediate geometric information encoded in the implicit field.
Visual localization is then achieved by aligning the image-based features and the rendered volumetric features.
arXiv Detail & Related papers (2024-06-12T17:51:53Z) - Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning [119.99066522299309]
KYN is a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density.
We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation.
We achieve state-of-the-art results in scene and object reconstruction on KITTI-360, and show improved zero-shot generalization compared to prior work.
arXiv Detail & Related papers (2024-04-04T17:59:59Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z) - Towers of Babel: Combining Images, Language, and 3D Geometry for
Learning Multimodal Vision [50.07532560364523]
We present a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names.
WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry.
arXiv Detail & Related papers (2021-08-12T17:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.