Self-supervised Cross-view Representation Reconstruction for Change
Captioning
- URL: http://arxiv.org/abs/2309.16283v1
- Date: Thu, 28 Sep 2023 09:28:50 GMT
- Title: Self-supervised Cross-view Representation Reconstruction for Change
Captioning
- Authors: Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming
Huang
- Abstract summary: Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
- Score: 113.08380679787247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Change captioning aims to describe the difference between a pair of similar
images. Its key challenge is how to learn a stable difference representation
under pseudo changes caused by viewpoint change. In this paper, we address this
by proposing a self-supervised cross-view representation reconstruction
(SCORER) network. Concretely, we first design a multi-head token-wise matching
to model relationships between cross-view features from similar/dissimilar
images. Then, by maximizing cross-view contrastive alignment of two similar
images, SCORER learns two view-invariant image representations in a
self-supervised way. Based on these, we reconstruct the representations of
unchanged objects by cross-attention, thus learning a stable difference
representation for caption generation. Further, we devise a cross-modal
backward reasoning to improve the quality of caption. This module reversely
models a ``hallucination'' representation with the caption and ``before''
representation. By pushing it closer to the ``after'' representation, we
enforce the caption to be informative about the difference in a self-supervised
manner. Extensive experiments show our method achieves the state-of-the-art
results on four datasets. The code is available at
https://github.com/tuyunbin/SCORER.
Related papers
- Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - R$^3$Net:Relation-embedded Representation Reconstruction Network for
Change Captioning [30.962341503501964]
Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images.
Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change.
We propose a Relation-embedded Representation Reconstruction Network (R$3$Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes.
arXiv Detail & Related papers (2021-10-20T00:57:39Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.