Related papers: ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

URL: http://arxiv.org/abs/2406.09613v1
Date: Thu, 13 Jun 2024 22:44:26 GMT
Title: ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
Authors: Wufei Ma, Guanning Zeng, Guofeng Zhang, Qihao Liu, Letian Zhang, Adam Kortylewski, Yaoyao Liu, Alan Yuille,
Abstract summary: We present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation.
Score: 20.837297477080945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize. In this work, we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, and (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning.. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation. Experimental results on ImageNet3D demonstrate the potential of our dataset in building vision models with stronger general-purpose object-level 3D understanding.

Related papers

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models [45.008146973701855]
N3D-VLM is a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning.<n>Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities.
arXiv Detail & Related papers (2025-12-18T14:03:44Z)
Detect Anything 3D in the Wild [34.293450721860616]
We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator and the 3D Interpreter with Zero-Embedding Mapping. Experimental results validate the strong generalization of our DetAny3D, which achieves state-of-the-art performance on unseen categories and novel camera configurations.
arXiv Detail & Related papers (2025-04-10T17:59:22Z)
Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning. UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding [10.81711535075112]
3D Visual Grounding aims to locate objects in 3D scenes based on textual descriptions. We introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions.
arXiv Detail & Related papers (2024-12-05T17:58:43Z)
General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes. Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z)
Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z)
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection [24.871590175483096]
Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics. We propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection.
arXiv Detail & Related papers (2023-09-18T03:31:53Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation [107.71752592196138]
We propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets. Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos.
arXiv Detail & Related papers (2023-01-18T18:14:18Z)
3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation [107.46972849241168]
3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture. Experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects.
arXiv Detail & Related papers (2022-12-02T11:31:49Z)
Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine [56.09471066808409]
We propose a new method on predicting image corresponding semantics in 3D domain and then projecting them back onto 2D images to achieve pixel-level understanding. We build a large scale keypoint knowledge engine called KeypointNet, which contains 103,450 keypoints and 8,234 3D models from 16 object categories.
arXiv Detail & Related papers (2021-11-21T13:25:20Z)
3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations [29.61554189447989]
We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations.
arXiv Detail & Related papers (2020-10-30T13:56:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.