Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
- URL: http://arxiv.org/abs/2311.06694v3
- Date: Sat, 6 Apr 2024 22:14:25 GMT
- Title: Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
- Authors: Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason,
- Abstract summary: We present the Multi-view Approach to Grounding in Context (MAGiC)
It selects an object referent based on language that distinguishes between two similar objects.
It improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9%.
- Score: 77.26626173589746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9\% (representing an absolute improvement of 2.7\%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/
Related papers
- 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Image Segmentation-based Unsupervised Multiple Objects Discovery [1.7674345486888503]
Unsupervised object discovery aims to localize objects in images.
We propose a fully unsupervised, bottom-up approach, for multiple objects discovery.
We provide state-of-the-art results for both unsupervised class-agnostic object detection and unsupervised image segmentation.
arXiv Detail & Related papers (2022-12-20T09:48:24Z) - ObjCAViT: Improving Monocular Depth Estimation Using Natural Language
Models And Image-Object Cross-Attention [22.539300644593936]
monocular depth estimation (MDE) is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions.
Humans and animals have been shown to use higher-level information to solve MDE.
We present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene.
arXiv Detail & Related papers (2022-11-30T18:32:06Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Object-Compositional Neural Implicit Surfaces [45.274466719163925]
The neural implicit representation has shown its effectiveness in novel view synthesis and high-quality 3D reconstruction from multi-view images.
This paper proposes a novel framework, ObjectSDF, to build an object-compositional neural implicit representation with high fidelity in 3D reconstruction and object representation.
arXiv Detail & Related papers (2022-07-20T06:38:04Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - Accurate Object Association and Pose Updating for Semantic SLAM [2.9602796547156323]
The proposed method is evaluated on a simulated sequence and several sequences in the Kitti dataset.
Experimental results show a very impressive improvement with respect to the traditional SLAM and the state-of-the-art semantic SLAM method.
arXiv Detail & Related papers (2020-12-21T14:21:09Z) - MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [51.45832752942529]
We propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet.
We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels.
Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets.
arXiv Detail & Related papers (2020-04-12T19:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.