OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding
- URL: http://arxiv.org/abs/2406.08009v1
- Date: Wed, 12 Jun 2024 08:59:33 GMT
- Title: OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding
- Authors: Yinan Deng, Jiahui Wang, Jingyu Zhao, Jianyu Dou, Yi Yang, Yufeng Yue,
- Abstract summary: We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding.
In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level.
The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
- Score: 21.64446104872021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.
Related papers
- Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO)
An object grounding task is proposed expecting vision systems to discover interacted objects.
We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z) - Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation [47.047267066525265]
We introduce a novel approach that incorporates object-level contextual knowledge within images.
Our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.
arXiv Detail & Related papers (2024-11-26T06:34:48Z) - OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding [43.69535335079362]
Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes.
Existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes.
We introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes.
arXiv Detail & Related papers (2024-08-20T17:31:48Z) - LOSS-SLAM: Lightweight Open-Set Semantic Simultaneous Localization and Mapping [9.289001828243512]
We show that a system of identifying, localizing, and encoding objects is tightly coupled with probabilistic graphical models for performing open-set semantic simultaneous localization and mapping (SLAM)
Results are presented demonstrating that the proposed lightweight object encoding can be used to perform more accurate object-based SLAM than existing open-set methods.
arXiv Detail & Related papers (2024-04-05T19:42:55Z) - OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose a new task 'open-world video instance segmentation and captioning'
It requires to detect, segment, track and describe with rich captions never before seen objects.
We develop an object abstractor and an object-to-text abstractor.
arXiv Detail & Related papers (2024-04-04T17:59:58Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Object-to-Scene: Learning to Transfer Object Knowledge to Indoor Scene
Recognition [19.503027767462605]
We propose an Object-to-Scene (OTS) method, which extracts object features and learns object relations to recognize indoor scenes.
OTS outperforms the state-of-the-art methods by more than 2% on indoor scene recognition without using any additional streams.
arXiv Detail & Related papers (2021-08-01T08:37:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.