Do large language vision models understand 3D shapes?
- URL: http://arxiv.org/abs/2412.10908v3
- Date: Sun, 19 Jan 2025 23:02:47 GMT
- Title: Do large language vision models understand 3D shapes?
- Authors: Sagi Eppel,
- Abstract summary: Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world.<n>This work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes.
- Score: 0.6993026261767287
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. A large number of test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some abstract understanding of 3D shapes but still trail far beyond humans in this task. Mainly it seems that the models can easily identify the same object with a different orientation as well as matching identical 3D shapes of the same orientation but with different materials and textures. However, when both the object material and orientation are changed, all models perform poorly relative to humans. Code and benchmark are available.
Related papers
- Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space [58.623106094568776]
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category.
We introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos.
Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
arXiv Detail & Related papers (2025-04-30T15:42:23Z) - Shape and Texture Recognition in Large Vision-Language Models [0.5266869303483376]
This dataset is used to evaluate how effectively leading Large Vision-Language Models (LVLMs) understand shapes, textures, and materials in 2D and 3D scenes.
For shape recognition, we test models' ability to match identical shapes that differ in orientation, texture, color, or environment.
For texture and material recognition, we evaluate models' ability to identify identical textures and materials across different objects and environments.
arXiv Detail & Related papers (2025-03-29T12:43:29Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - 3D Reconstruction of Objects in Hands without Real World 3D Supervision [12.70221786947807]
We propose modules to leverage 3D supervision to scale up the learning of models for reconstructing hand-held objects.
Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections.
We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image.
arXiv Detail & Related papers (2023-05-04T17:56:48Z) - 3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z) - Disentangled3D: Learning a 3D Generative Model with Disentangled
Geometry and Appearance from Monocular Images [94.49117671450531]
State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis.
In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations.
arXiv Detail & Related papers (2022-03-29T22:03:18Z) - Neural Articulated Radiance Field [90.91714894044253]
We present Neural Articulated Radiance Field (NARF), a novel deformable 3D representation for articulated objects learned from images.
Experiments show that the proposed method is efficient and can generalize well to novel poses.
arXiv Detail & Related papers (2021-04-07T13:23:14Z) - Cycle-Consistent Generative Rendering for 2D-3D Modality Translation [21.962725416347855]
We learn a module that generates a realistic rendering of a 3D object and infers a realistic 3D shape from an image.
By leveraging generative domain translation methods, we are able to define a learning algorithm that requires only weak supervision, with unpaired data.
The resulting model is able to perform 3D shape, pose, and texture inference from 2D images, but can also generate novel textured 3D shapes and renders.
arXiv Detail & Related papers (2020-11-16T15:23:03Z) - Learning to Reconstruct and Segment 3D Objects [4.709764624933227]
We aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks.
This thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
arXiv Detail & Related papers (2020-10-19T15:09:04Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - GRAB: A Dataset of Whole-Body Human Grasping of Objects [53.00728704389501]
Training computers to understand human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time.
We collect a new dataset, called GRAB, of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size.
This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task.
arXiv Detail & Related papers (2020-08-25T17:57:55Z) - ROOTS: Object-Centric Representation and Rendering of 3D Scenes [28.24758046060324]
A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations.
Recent works achieve object-centric generation but without the ability to infer the representation.
We propose a probabilistic generative model for learning to build modular and compositional 3D object models.
arXiv Detail & Related papers (2020-06-11T00:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.