Related papers: The 3D-PC: a benchmark for visual perspective taking in humans and machines

The 3D-PC: a benchmark for visual perspective taking in humans and machines

URL: http://arxiv.org/abs/2406.04138v1
Date: Thu, 6 Jun 2024 14:59:39 GMT
Title: The 3D-PC: a benchmark for visual perspective taking in humans and machines
Authors: Drew Linsley, Peisen Zhou, Alekh Karkada Ashok, Akash Nagaraj, Gaurav Gaonkar, Francis E Lewis, Zygmunt Pizlo, Thomas Serre,
Abstract summary: A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for visual perspective taking (VPT) with the 3D perception challenge (3D-PC) The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images.
Score: 11.965236208112753
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-perturb. Our challenge demonstrates that the training routines and architectures of today's DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties like humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.

Related papers

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z)
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views [41.05815610513033]
3DThinker is a framework that exploits the rich geometric information embedded within images while reasoning, like humans do.<n>Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training.
arXiv Detail & Related papers (2025-10-21T13:36:58Z)
Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding. We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z)
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction [77.15924044466976]
We propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations.
arXiv Detail & Related papers (2023-11-21T17:59:14Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Approaching human 3D shape perception with neurally mappable models [15.090436065092716]
Humans effortlessly infer the 3D shape of objects. None of current computational models capture the human ability to match object shape across viewpoints. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures.
arXiv Detail & Related papers (2023-08-22T09:29:05Z)
3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA) This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions. We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z)
Learning to Estimate 3D Human Pose from Point Cloud [13.27496851711973]
We propose a deep human pose network for 3D pose estimation by taking the point cloud data as input data to model the surface of complex human structures. Our experiments on two public datasets show that our approach achieves higher accuracy than previous state-of-art methods.
arXiv Detail & Related papers (2022-12-25T14:22:01Z)
Harmonizing the object recognition strategies of deep neural networks with humans [10.495114898741205]
We show that state-of-the-art deep neural networks (DNNs) are becoming less aligned with humans as their accuracy improves. Our work represents the first demonstration that the scaling laws that are guiding the design of DNNs today have also produced worse models of human vision.
arXiv Detail & Related papers (2022-11-08T20:03:49Z)
Super Images -- A New 2D Perspective on 3D Medical Imaging Analysis [0.0]
We present a simple yet effective 2D method to handle 3D data while efficiently embedding the 3D knowledge during training. Our method generates a super-resolution image by stitching slices side by side in the 3D image. While attaining equal, if not superior, results to 3D networks utilizing only 2D counterparts, the model complexity is reduced by around threefold.
arXiv Detail & Related papers (2022-05-05T09:59:03Z)
PONet: Robust 3D Human Pose Estimation via Learning Orientations Only [116.1502793612437]
We propose a novel Pose Orientation Net (PONet) that is able to robustly estimate 3D pose by learning orientations only. PONet estimates the 3D orientation of these limbs by taking advantage of the local image evidence to recover the 3D pose. We evaluate our method on multiple datasets, including Human3.6M, MPII, MPI-INF-3DHP, and 3DPW.
arXiv Detail & Related papers (2021-12-21T12:48:48Z)
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings [9.651400924429336]
We present the SPARE3D dataset. Based on cognitive science and psychometrics, SPARE3D contains three types of 2D-3D reasoning tasks on view consistency, camera pose, and shape generation. We then design a method to automatically generate a large number of challenging questions with ground truth answers for each task. Experiments show that although convolutional networks have achieved superhuman performance in many visual learning tasks, their spatial reasoning performance on SPARE3D tasks is either lower than average human performance or even close to random guesses.
arXiv Detail & Related papers (2020-03-31T09:01:27Z)
2.75D: Boosting learning by representing 3D Medical imaging to 2D features for small data [54.223614679807994]
3D convolutional neural networks (CNNs) have started to show superior performance to 2D CNNs in numerous deep learning tasks. Applying transfer learning on 3D CNN is challenging due to a lack of publicly available pre-trained 3D models. In this work, we proposed a novel 2D strategical representation of volumetric data, namely 2.75D. As a result, 2D CNN networks can also be used to learn volumetric information.
arXiv Detail & Related papers (2020-02-11T08:24:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.