Describing and Localizing Multiple Changes with Transformers
- URL: http://arxiv.org/abs/2103.14146v1
- Date: Thu, 25 Mar 2021 21:52:03 GMT
- Title: Describing and Localizing Multiple Changes with Transformers
- Authors: Yue Qiu and Shintaro Yamamoto and Kodai Nakashima and Ryota Suzuki and
Kenji Iwata and Hirokatsu Kataoka and Yutaka Satoh
- Abstract summary: Change captioning tasks aim to detect changes in image pairs observed before and after a scene change.
We propose a CG-based multi-change captioning dataset.
We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning.
- Score: 24.138480002212994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Change captioning tasks aim to detect changes in image pairs observed before
and after a scene change and generate a natural language description of the
changes. Existing change captioning studies have mainly focused on scenes with
a single change. However, detecting and describing multiple changed parts in
image pairs is essential for enhancing adaptability to complex scenarios. We
solve the above issues from three aspects: (i) We propose a CG-based
multi-change captioning dataset; (ii) We benchmark existing state-of-the-art
methods of single change captioning on multi-change captioning; (iii) We
further propose Multi-Change Captioning transformers (MCCFormers) that identify
change regions by densely correlating different regions in image pairs and
dynamically determines the related change regions with words in sentences. The
proposed method obtained the highest scores on four conventional change
captioning evaluation metrics for multi-change captioning. In addition,
existing methods generate a single attention map for multiple changes and lack
the ability to distinguish change regions. In contrast, our proposed method can
separate attention maps for each change and performs well with respect to
change localization. Moreover, the proposed framework outperformed the previous
state-of-the-art methods on an existing change captioning benchmark,
CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores),
indicating its general ability in change captioning tasks.
Related papers
- Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning [0.846600473226587]
We introduce ChangeChat, the first bitemporal vision-language model (VLM) designed specifically for RS change analysis.
ChangeChat utilizes multimodal instruction tuning, allowing it to handle complex queries such as change captioning, category-specific quantification, and change localization.
Experiments show that ChangeChat offers a comprehensive, interactive solution for RS change analysis, achieving performance comparable to or even better than state-of-the-art (SOTA) methods on specific tasks.
arXiv Detail & Related papers (2024-09-13T07:00:44Z) - Context-aware Difference Distilling for Multi-change Captioning [106.72151597074098]
Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language.
We propose a novel context-aware difference distilling network to capture all genuine changes for yielding sentences.
arXiv Detail & Related papers (2024-05-31T14:07:39Z) - MS-Former: Memory-Supported Transformer for Weakly Supervised Change
Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection.
MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS)
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - The Change You Want to See [91.3755431537592]
Given two images of the same scene, being able to automatically detect the changes in them has practical applications in a variety of domains.
We tackle the change detection problem with the goal of detecting "object-level" changes in an image pair despite differences in their viewpoint and illumination.
arXiv Detail & Related papers (2022-09-28T18:10:09Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for
Change Captioning [41.044241265804125]
We propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task.
We also propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards.
Our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.
arXiv Detail & Related papers (2020-09-30T00:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.