Related papers: Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

URL: http://arxiv.org/abs/2405.18295v2
Date: Sat, 6 Jul 2024 15:23:45 GMT
Title: Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Authors: Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan,
Abstract summary: 3D intention grounding is a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back" We introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also propose IntentNet, our unique approach, designed to tackle this intention-based detection problem.
Score: 86.39271731460927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization.

Related papers

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions [43.784500616459304]
We introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions. We propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network.
arXiv Detail & Related papers (2025-04-07T05:38:23Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern. New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z)
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. This paper aims to explore dense 3D representations for gait recognition in the wild. We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
Seeing by haptic glance: reinforcement learning-based 3D object Recognition [31.80213713136647]
Human is able to conduct 3D recognition by a limited number of haptic contacts between the target object and his/her fingers without seeing the object. This capability is defined as haptic glance' in cognitive neuroscience. Most of the existing 3D recognition models were developed based on dense 3D data. In many real-life use cases, where robots are used to collect 3D data by haptic exploration, only a limited number of 3D points could be collected. A novel reinforcement learning based framework is proposed, where the haptic exploration procedure is optimized simultaneously with the objective 3D recognition with actively collected 3D
arXiv Detail & Related papers (2021-02-15T15:38:22Z)
Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots. Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation. We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z)
PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space. We propose a model that unifies these two tasks in the same metric space. Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z)
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving. We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables. Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.