What is Right for Me is Not Yet Right for You: A Dataset for Grounding
Relative Directions via Multi-Task Learning
- URL: http://arxiv.org/abs/2205.02671v1
- Date: Thu, 5 May 2022 14:25:46 GMT
- Title: What is Right for Me is Not Yet Right for You: A Dataset for Grounding
Relative Directions via Multi-Task Learning
- Authors: Jae Hee Lee, Matthias Kerzel, Kyra Ahrens, Cornelius Weber and Stefan
Wermter
- Abstract summary: We investigate the problem of grounding relative directions with end-to-end neural networks.
GRiD-3D is a novel dataset that features relative directions and complements existing visual question answering (VQA) datasets.
We discover that those subtasks are learned in an order that reflects the steps of an intuitive pipeline for processing relative directions.
- Score: 16.538887534958555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding spatial relations is essential for intelligent agents to act
and communicate in the physical world. Relative directions are spatial
relations that describe the relative positions of target objects with regard to
the intrinsic orientation of reference objects. Grounding relative directions
is more difficult than grounding absolute directions because it not only
requires a model to detect objects in the image and to identify spatial
relation based on this information, but it also needs to recognize the
orientation of objects and integrate this information into the reasoning
process. We investigate the challenging problem of grounding relative
directions with end-to-end neural networks. To this end, we provide GRiD-3D, a
novel dataset that features relative directions and complements existing visual
question answering (VQA) datasets, such as CLEVR, that involve only absolute
directions. We also provide baselines for the dataset with two established
end-to-end VQA models. Experimental evaluations show that answering questions
on relative directions is feasible when questions in the dataset simulate the
necessary subtasks for grounding relative directions. We discover that those
subtasks are learned in an order that reflects the steps of an intuitive
pipeline for processing relative directions.
Related papers
- Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions [18.736071151303726]
This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge.
Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.
arXiv Detail & Related papers (2024-02-26T07:33:28Z) - EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote
Sensing Visual Question Answering [11.37120215795946]
We develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis.
The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.
We propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way.
arXiv Detail & Related papers (2023-12-19T15:11:32Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - RSG-Net: Towards Rich Sematic Relationship Prediction for Intelligent
Vehicle in Complex Environments [72.04891523115535]
We propose RSG-Net (Road Scene Graph Net): a graph convolutional network designed to predict potential semantic relationships from object proposals.
The experimental results indicate that this network, trained on Road Scene Graph dataset, could efficiently predict potential semantic relationships among objects around the ego-vehicle.
arXiv Detail & Related papers (2022-07-16T12:40:17Z) - Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects.
Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions.
We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Exploiting Scene-specific Features for Object Goal Navigation [9.806910643086043]
We introduce a new reduced dataset that speeds up the training of navigation models.
Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times.
We propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them.
arXiv Detail & Related papers (2020-08-21T10:16:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.