Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning
- URL: http://arxiv.org/abs/2207.02624v1
- Date: Wed, 6 Jul 2022 12:31:49 GMT
- Title: Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning
- Authors: Kyra Ahrens, Matthias Kerzel, Jae Hee Lee, Cornelius Weber, Stefan
Wermter
- Abstract summary: We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects.
Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions.
We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
- Score: 16.538887534958555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial reasoning poses a particular challenge for intelligent agents and is
at the same time a prerequisite for their successful interaction and
communication in the physical world. One such reasoning task is to describe the
position of a target object with respect to the intrinsic orientation of some
reference object via relative directions. In this paper, we introduce
GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on
abstract objects. Our dataset allows for a fine-grained analysis of end-to-end
VQA models' capabilities to ground relative directions. At the same time, model
training requires considerably fewer computational resources compared with
existing datasets, yet yields a comparable or even higher performance. Along
with the new dataset, we provide a thorough evaluation based on two widely
known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that
within a few epochs, the subtasks required to reason over relative directions,
such as recognizing and locating objects in a scene and estimating their
intrinsic orientations, are learned in the order in which relative directions
are intuitively processed.
Related papers
- A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning.
We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories.
We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z) - Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote
Sensing Visual Question Answering [11.37120215795946]
We develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis.
The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.
We propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way.
arXiv Detail & Related papers (2023-12-19T15:11:32Z) - Weakly-supervised 3D Pose Transfer with Keypoints [57.66991032263699]
Main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies.
We propose a novel weakly-supervised keypoint-based framework to overcome these difficulties.
arXiv Detail & Related papers (2023-07-25T12:40:24Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline [35.717047755880536]
3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
arXiv Detail & Related papers (2022-09-24T15:09:02Z) - What is Right for Me is Not Yet Right for You: A Dataset for Grounding
Relative Directions via Multi-Task Learning [16.538887534958555]
We investigate the problem of grounding relative directions with end-to-end neural networks.
GRiD-3D is a novel dataset that features relative directions and complements existing visual question answering (VQA) datasets.
We discover that those subtasks are learned in an order that reflects the steps of an intuitive pipeline for processing relative directions.
arXiv Detail & Related papers (2022-05-05T14:25:46Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Exploiting Scene-specific Features for Object Goal Navigation [9.806910643086043]
We introduce a new reduced dataset that speeds up the training of navigation models.
Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times.
We propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them.
arXiv Detail & Related papers (2020-08-21T10:16:01Z) - Dynamic Refinement Network for Oriented and Densely Packed Object
Detection [75.29088991850958]
We present a dynamic refinement network that consists of two novel components, i.e., a feature selection module (FSM) and a dynamic refinement head (DRH)
Our FSM enables neurons to adjust receptive fields in accordance with the shapes and orientations of target objects, whereas the DRH empowers our model to refine the prediction dynamically in an object-aware manner.
We perform quantitative evaluations on several publicly available benchmarks including DOTA, HRSC2016, SKU110K, and our own SKU110K-R dataset.
arXiv Detail & Related papers (2020-05-20T11:35:50Z) - Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets
without Superior Knowledge [55.32035138692167]
Cross-modal knowledge distillation deals with transferring knowledge from a model trained with superior modalities to another model trained with weak modalities.
We propose a novel scheme to train the Student in a Target dataset where the Teacher is unavailable.
arXiv Detail & Related papers (2020-04-01T00:28:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.