R$^3$Net:Relation-embedded Representation Reconstruction Network for
Change Captioning
- URL: http://arxiv.org/abs/2110.10328v1
- Date: Wed, 20 Oct 2021 00:57:39 GMT
- Title: R$^3$Net:Relation-embedded Representation Reconstruction Network for
Change Captioning
- Authors: Yunbin Tu, Liang Li, Chenggang Yan, Shengxiang Gao, Zhengtao Yu
- Abstract summary: Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images.
Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change.
We propose a Relation-embedded Representation Reconstruction Network (R$3$Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes.
- Score: 30.962341503501964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Change captioning is to use a natural language sentence to describe the
fine-grained disagreement between two similar images. Viewpoint change is the
most typical distractor in this task, because it changes the scale and location
of the objects and overwhelms the representation of real change. In this paper,
we propose a Relation-embedded Representation Reconstruction Network (R$^3$Net)
to explicitly distinguish the real change from the large amount of clutter and
irrelevant changes. Specifically, a relation-embedded module is first devised
to explore potential changed objects in the large amount of clutter. Then,
based on the semantic similarities of corresponding locations in the two
images, a representation reconstruction module (RRM) is designed to learn the
reconstruction representation and further model the difference representation.
Besides, we introduce a syntactic skeleton predictor (SSP) to enhance the
semantic interaction between change localization and caption generation.
Extensive experiments show that the proposed method achieves the
state-of-the-art results on two public datasets.
Related papers
- Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Self-supervised Cross-view Representation Reconstruction for Change
Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Align, Perturb and Decouple: Toward Better Leverage of Difference
Information for RSI Change Detection [24.249552791014644]
Change detection is a widely adopted technique in remote sense imagery (RSI) analysis.
We propose a series of operations to fully exploit the difference information: Alignment, Perturbation and Decoupling.
arXiv Detail & Related papers (2023-05-30T03:39:53Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Region Similarity Representation Learning [94.88055458257081]
Region Similarity Representation Learning (ReSim) is a new approach to self-supervised representation learning for localization-based tasks.
ReSim learns both regional representations for localization as well as semantic image-level representations.
We show how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline.
arXiv Detail & Related papers (2021-03-24T00:42:37Z) - Image Captioning with Visual Object Representations Grounded in the
Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality.
We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.