ObjCAViT: Improving Monocular Depth Estimation Using Natural Language
Models And Image-Object Cross-Attention
- URL: http://arxiv.org/abs/2211.17232v1
- Date: Wed, 30 Nov 2022 18:32:06 GMT
- Title: ObjCAViT: Improving Monocular Depth Estimation Using Natural Language
Models And Image-Object Cross-Attention
- Authors: Dylan Auty and Krystian Mikolajczyk
- Abstract summary: monocular depth estimation (MDE) is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions.
Humans and animals have been shown to use higher-level information to solve MDE.
We present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene.
- Score: 22.539300644593936
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While monocular depth estimation (MDE) is an important problem in computer
vision, it is difficult due to the ambiguity that results from the compression
of a 3D scene into only 2 dimensions. It is common practice in the field to
treat it as simple image-to-image translation, without consideration for the
semantics of the scene and the objects within it. In contrast, humans and
animals have been shown to use higher-level information to solve MDE: prior
knowledge of the nature of the objects in the scene, their positions and likely
configurations relative to one another, and their apparent sizes have all been
shown to help resolve this ambiguity.
In this paper, we present a novel method to enhance MDE performance by
encouraging use of known-useful information about the semantics of objects and
inter-object relationships within a scene. Our novel ObjCAViT module sources
world-knowledge from language models and learns inter-object relationships in
the context of the MDE problem using transformer attention, incorporating
apparent size information. Our method produces highly accurate depth maps, and
we obtain competitive results on the NYUv2 and KITTI datasets. Our ablation
experiments show that the use of language and cross-attention within the
ObjCAViT module increases performance. Code is released at
https://github.com/DylanAuty/ObjCAViT.
Related papers
- Interpretable Action Recognition on Hard to Classify Actions [11.641926922266347]
Humans recognise complex activities in video by recognising critical-temporal relations among explicitly recognised objects and parts.
To mimic this we build on a model which uses positions of objects and hands, and their motions, to recognise the activity taking place.
To improve this model we focussed on three of the most confused classes (for this model) and identified that the lack of 3D information was the major problem.
A state-of-the-art object detection model was fine-tuned to determine the difference between "Container" and "NotContainer" in order to integrate object shape information into the existing object features.
arXiv Detail & Related papers (2024-09-19T21:23:44Z) - Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding [77.26626173589746]
We present the Multi-view Approach to Grounding in Context (MAGiC)
It selects an object referent based on language that distinguishes between two similar objects.
It improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9%.
arXiv Detail & Related papers (2023-11-12T00:21:58Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Context-aware 6D Pose Estimation of Known Objects using RGB-D data [3.48122098223937]
6D object pose estimation has been a research topic in the field of computer vision and robotics.
We present an architecture that, unlike prior work, is context-aware.
Our experiments show an enhancement in the accuracy of about 3.2% over the LineMOD dataset.
arXiv Detail & Related papers (2022-12-11T18:01:01Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - Monocular Depth Estimation Using Cues Inspired by Biological Vision
Systems [22.539300644593936]
Monocular depth estimation (MDE) aims to transform an RGB image of a scene into a pixelwise depth map from the same camera view.
Part of the MDE task is to learn which visual cues in the image can be used for depth estimation, and how.
We demonstrate that explicitly injecting visual cue information into the model is beneficial for depth estimation.
arXiv Detail & Related papers (2022-04-21T19:42:36Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.