SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings
- URL: http://arxiv.org/abs/2003.14034v2
- Date: Wed, 2 Sep 2020 14:18:47 GMT
- Title: SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings
- Authors: Wenyu Han, Siyuan Xiang, Chenhui Liu, Ruoyu Wang, Chen Feng
- Abstract summary: We present the SPARE3D dataset. Based on cognitive science and psychometrics, SPARE3D contains three types of 2D-3D reasoning tasks on view consistency, camera pose, and shape generation.
We then design a method to automatically generate a large number of challenging questions with ground truth answers for each task.
Experiments show that although convolutional networks have achieved superhuman performance in many visual learning tasks, their spatial reasoning performance on SPARE3D tasks is either lower than average human performance or even close to random guesses.
- Score: 9.651400924429336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial reasoning is an important component of human intelligence. We can
imagine the shapes of 3D objects and reason about their spatial relations by
merely looking at their three-view line drawings in 2D, with different levels
of competence. Can deep networks be trained to perform spatial reasoning tasks?
How can we measure their "spatial intelligence"? To answer these questions, we
present the SPARE3D dataset. Based on cognitive science and psychometrics,
SPARE3D contains three types of 2D-3D reasoning tasks on view consistency,
camera pose, and shape generation, with increasing difficulty. We then design a
method to automatically generate a large number of challenging questions with
ground truth answers for each task. They are used to provide supervision for
training our baseline models using state-of-the-art architectures like ResNet.
Our experiments show that although convolutional networks have achieved
superhuman performance in many visual learning tasks, their spatial reasoning
performance on SPARE3D tasks is either lower than average human performance or
even close to random guesses. We hope SPARE3D can stimulate new problem
formulations and network designs for spatial reasoning to empower intelligent
robots to operate effectively in the 3D world via 2D sensors. The dataset and
code are available at https://ai4ce.github.io/SPARE3D.
Related papers
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark [17.94511890272007]
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space.
Large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks.
We present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs.
arXiv Detail & Related papers (2024-12-10T18:55:23Z) - Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes.
We create ReasonSeg3D, a benchmark that integrates 3D segmentation masks and 3D spatial relations with generated question-answer pairs.
In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories.
We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - The 3D-PC: a benchmark for visual perspective taking in humans and machines [11.965236208112753]
A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets.
We investigated if this emergent ability for 3D analysis in DNNs is sufficient for visual perspective taking (VPT) with the 3D perception challenge (3D-PC)
The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images.
arXiv Detail & Related papers (2024-06-06T14:59:39Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - On the Efficacy of 3D Point Cloud Reinforcement Learning [20.4424883945357]
We focus on 3D point clouds, one of the most common forms of 3D representations.
We systematically investigate design choices for 3D point cloud RL, leading to the development of a robust algorithm for various robotic manipulation and control tasks.
We find that 3D point cloud RL can significantly outperform the 2D counterpart when agent-object / object-object relationship encoding is a key factor.
arXiv Detail & Related papers (2023-06-11T22:52:08Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances.
Recently, researchers started to shift focus from 2D to 3D space.
representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z) - Decanus to Legatus: Synthetic training for 2D-3D human pose lifting [26.108023246654646]
We propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus)
Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the potential of our framework.
arXiv Detail & Related papers (2022-10-05T13:10:19Z) - Super Images -- A New 2D Perspective on 3D Medical Imaging Analysis [0.0]
We present a simple yet effective 2D method to handle 3D data while efficiently embedding the 3D knowledge during training.
Our method generates a super-resolution image by stitching slices side by side in the 3D image.
While attaining equal, if not superior, results to 3D networks utilizing only 2D counterparts, the model complexity is reduced by around threefold.
arXiv Detail & Related papers (2022-05-05T09:59:03Z) - Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery.
Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z) - 3D Self-Supervised Methods for Medical Imaging [7.65168530693281]
We propose 3D versions for five different self-supervised methods, in the form of proxy tasks.
Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation.
The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks.
arXiv Detail & Related papers (2020-06-06T09:56:58Z) - CRAVES: Controlling Robotic Arm with a Vision-based Economic System [96.56564257199474]
Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry.<n>This work discusses the role of computer vision algorithms in this field.<n>We present an alternative solution, which uses a 3D model to create a large number of synthetic data.
arXiv Detail & Related papers (2018-12-03T13:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.