Related papers: EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

URL: http://arxiv.org/abs/2312.12222v1
Date: Tue, 19 Dec 2023 15:11:32 GMT
Title: EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
Authors: Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong
Abstract summary: We develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. We propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way.
Score: 11.37120215795946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.

Related papers

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks. Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales. We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
Leverage Task Context for Object Affordance Ranking [57.59106517732223]
We build the first large-scale task-oriented affordance ranking dataset with 25 common tasks, over 50k images and more than 661k objects. Results demonstrate the feasibility of the task context based affordance learning paradigm and the superiority of our model over state-of-the-art models in the fields of saliency ranking and multimodal object detection.
arXiv Detail & Related papers (2024-11-25T04:22:33Z)
A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z)
Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries. We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z)
Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs [5.891295920078768]
We introduce an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map.
arXiv Detail & Related papers (2024-04-01T14:53:36Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
An Object SLAM Framework for Association, Mapping, and High-Level Tasks [12.62957558651032]
We present a comprehensive object SLAM framework that focuses on object-based perception and object-oriented robot tasks. A range of public datasets and real-world results have been used to evaluate the proposed object SLAM framework for its efficient performance.
arXiv Detail & Related papers (2023-05-12T08:10:14Z)
FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection [4.534713782093219]
A novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems. FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM)
arXiv Detail & Related papers (2023-01-08T03:53:50Z)
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
Dynamic Refinement Network for Oriented and Densely Packed Object Detection [75.29088991850958]
We present a dynamic refinement network that consists of two novel components, i.e., a feature selection module (FSM) and a dynamic refinement head (DRH) Our FSM enables neurons to adjust receptive fields in accordance with the shapes and orientations of target objects, whereas the DRH empowers our model to refine the prediction dynamically in an object-aware manner. We perform quantitative evaluations on several publicly available benchmarks including DOTA, HRSC2016, SKU110K, and our own SKU110K-R dataset.
arXiv Detail & Related papers (2020-05-20T11:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.