Neighborhood Contrastive Transformer for Change Captioning
- URL: http://arxiv.org/abs/2303.03171v1
- Date: Mon, 6 Mar 2023 14:39:54 GMT
- Title: Neighborhood Contrastive Transformer for Change Captioning
- Authors: Yunbin Tu, Liang Li, Li Su, Ke Lu, Qingming Huang
- Abstract summary: We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
- Score: 80.10836469177185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Change captioning is to describe the semantic change between a pair of
similar images in natural language. It is more challenging than general image
captioning, because it requires capturing fine-grained change information while
being immune to irrelevant viewpoint changes, and solving syntax ambiguity in
change descriptions. In this paper, we propose a neighborhood contrastive
transformer to improve the model's perceiving ability for various changes under
different scenes and cognition ability for complex syntax structure.
Concretely, we first design a neighboring feature aggregating to integrate
neighboring context into each feature, which helps quickly locate the
inconspicuous changes under the guidance of conspicuous referents. Then, we
devise a common feature distilling to compare two images at neighborhood level
and extract common properties from each image, so as to learn effective
contrastive information between them. Finally, we introduce the explicit
dependencies between words to calibrate the transformer decoder, which helps
better understand complex syntax structure during training. Extensive
experimental results demonstrate that the proposed method achieves the
state-of-the-art performance on three public datasets with different change
scenarios. The code is available at https://github.com/tuyunbin/NCT.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Context-aware Difference Distilling for Multi-change Captioning [106.72151597074098]
Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language.
We propose a novel context-aware difference distilling network to capture all genuine changes for yielding sentences.
arXiv Detail & Related papers (2024-05-31T14:07:39Z) - Self-supervised Cross-view Representation Reconstruction for Change
Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z) - Changes to Captions: An Attentive Network for Remote Sensing Change
Captioning [15.986576036345333]
This study highlights the significance of accurately describing changes in remote sensing images.
We propose an attentive changes-to-captions network, called Chg2Cap for short, for bi-temporal remote sensing images.
The proposed Chg2Cap network is evaluated on two representative remote sensing datasets.
arXiv Detail & Related papers (2023-04-03T15:51:42Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - R$^3$Net:Relation-embedded Representation Reconstruction Network for
Change Captioning [30.962341503501964]
Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images.
Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change.
We propose a Relation-embedded Representation Reconstruction Network (R$3$Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes.
arXiv Detail & Related papers (2021-10-20T00:57:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.