Change Detection Meets Visual Question Answering
- URL: http://arxiv.org/abs/2112.06343v1
- Date: Sun, 12 Dec 2021 22:39:20 GMT
- Title: Change Detection Meets Visual Question Answering
- Authors: Zhenghang Yuan, Lichao Mou, Zhitong Xiong, Xiaoxiang Zhu
- Abstract summary: We introduce a novel task: change detection-based visual question answering (CDVQA) on multi-temporal aerial images.
In particular, multi-temporal images can be queried to obtain high level change-based information according to content changes between two input images.
A baseline CDVQA framework is devised in this work, and it contains four parts: multi-temporal feature encoding, multi-temporal fusion, multi-modal fusion, and answer prediction.
- Score: 23.63790450326685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Earth's surface is continually changing, and identifying changes plays an
important role in urban planning and sustainability. Although change detection
techniques have been successfully developed for many years, these techniques
are still limited to experts and facilitators in related fields. In order to
provide every user with flexible access to change information and help them
better understand land-cover changes, we introduce a novel task: change
detection-based visual question answering (CDVQA) on multi-temporal aerial
images. In particular, multi-temporal images can be queried to obtain high
level change-based information according to content changes between two input
images. We first build a CDVQA dataset including multi-temporal
image-question-answer triplets using an automatic question-answer generation
method. Then, a baseline CDVQA framework is devised in this work, and it
contains four parts: multi-temporal feature encoding, multi-temporal fusion,
multi-modal fusion, and answer prediction. In addition, we also introduce a
change enhancing module to multi-temporal feature encoding, aiming at
incorporating more change-related information. Finally, effects of different
backbones and multi-temporal fusion strategies are studied on the performance
of CDVQA task. The experimental results provide useful insights for developing
better CDVQA models, which are important for future research on this task. We
will make our dataset and code publicly available.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection [82.65760006883248]
We introduce a new task named Change Detection Question Answering and Grounding (CDQAG)
CDQAG extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence.
We construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks.
arXiv Detail & Related papers (2024-10-31T11:20:13Z) - DRFormer: Multi-Scale Transformer Utilizing Diverse Receptive Fields for Long Time-Series Forecasting [3.420673126033772]
We propose a dynamic tokenizer with a dynamic sparse learning algorithm to capture diverse receptive fields and sparse patterns of time series data.
Our proposed model, named DRFormer, is evaluated on various real-world datasets, and experimental results demonstrate its superiority compared to existing methods.
arXiv Detail & Related papers (2024-08-05T07:26:47Z) - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with
Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images.
We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges.
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z) - Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.
We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images.
We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z) - ChangeNet: Multi-Temporal Asymmetric Change Detection Dataset [20.585593022144398]
ChangeNet consists of 31,000 multi-temporal images pairs, a wide range of complex scenes from 100 cities, and 6 pixel-level categories.
ChangeNet contains amounts of real-world perspective distortions in different temporal phases on the same areas.
The ChangeNet dataset is suitable for both binary change detection (BCD) and semantic change detection (SCD) tasks.
arXiv Detail & Related papers (2023-12-29T01:42:20Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Domain-robust VQA with diverse datasets and methods but no target labels [34.331228652254566]
Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity.
To tackle these challenges, we first quantify domain shifts between popular VQA datasets.
We also construct synthetic shifts in the image and question domains separately.
arXiv Detail & Related papers (2021-03-29T22:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.