EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries
- URL: http://arxiv.org/abs/2212.06969v2
- Date: Mon, 28 Aug 2023 12:51:20 GMT
- Title: EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries
- Authors: Jinjie Mai, Abdullah Hamdi, Silvio Giancola, Chen Zhao, Bernard Ghanem
- Abstract summary: We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
- Score: 68.75400888770793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent advances in video and 3D understanding, novel 4D
spatio-temporal methods fusing both concepts have emerged. Towards this
direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual
Queries with 3D Localization (VQ3D). Given an egocentric video clip and an
image crop depicting a query object, the goal is to localize the 3D position of
the center of that query object with respect to the camera pose of a query
frame. Current methods tackle the problem of VQ3D by unprojecting the 2D
localization results of the sibling task Visual Queries with 2D Localization
(VQ2D) into 3D predictions. Yet, we point out that the low number of camera
poses caused by camera re-localization from previous VQ3D methods severally
hinders their overall success rate. In this work, we formalize a pipeline (we
dub EgoLoc) that better entangles 3D multiview geometry with 2D object
retrieval from egocentric videos. Our approach involves estimating more robust
camera poses and aggregating multi-view 3D displacements by leveraging the 2D
detection confidence, which enhances the success rate of object queries and
leads to a significant improvement in the VQ3D baseline performance.
Specifically, our approach achieves an overall success rate of up to 87.12%,
which sets a new state-of-the-art result in the VQ3D task. We provide a
comprehensive empirical analysis of the VQ3D task and existing solutions, and
highlight the remaining challenges in VQ3D. The code is available at
https://github.com/Wayne-Mai/EgoLoc.
Related papers
- Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - EgoCOL: Egocentric Camera pose estimation for Open-world 3D object
Localization @Ego4D challenge 2023 [9.202585784962276]
We present EgoCOL, an egocentric camera pose estimation method for open-world 3D object localization.
Our method leverages sparse camera pose reconstructions in a two-fold manner, video and scan independently, to estimate the camera pose of egocentric frames in 3D renders with high recall and precision.
arXiv Detail & Related papers (2023-06-29T00:17:23Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - Neural Voting Field for Camera-Space 3D Hand Pose Estimation [106.34750803910714]
We present a unified framework for camera-space 3D hand pose estimation from a single RGB image based on 3D implicit representation.
We propose a novel unified 3D dense regression scheme to estimate camera-space 3D hand pose via dense 3D point-wise voting in camera frustum.
arXiv Detail & Related papers (2023-05-07T16:51:34Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - Estimating more camera poses for ego-centric videos is essential for
VQ3D [70.78927854445615]
We develop a new pipeline for the challenging egocentric video camera pose estimation problem in our work.
We get the top-1 overall success rate of 25.8% on VQ3D leaderboard, which is two times better than the 8.7% reported by the baseline.
arXiv Detail & Related papers (2022-11-18T15:16:49Z) - 3D Human Pose Estimation in Multi-View Operating Room Videos Using
Differentiable Camera Projections [2.486571221735935]
We propose to directly optimise for localisation in 3D by training 2D CNNs end-to-end based on a 3D loss.
Using videos from the MVOR dataset, we show that this end-to-end approach outperforms optimisation in 2D space.
arXiv Detail & Related papers (2022-10-21T09:00:02Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.