Context-aware Difference Distilling for Multi-change Captioning
- URL: http://arxiv.org/abs/2405.20810v2
- Date: Fri, 7 Jun 2024 04:27:30 GMT
- Title: Context-aware Difference Distilling for Multi-change Captioning
- Authors: Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang,
- Abstract summary: Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language.
We propose a novel context-aware difference distilling network to capture all genuine changes for yielding sentences.
- Score: 106.72151597074098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language. Compared with single-change captioning, this task requires the model to have higher-level cognition ability to reason an arbitrary number of changes. In this paper, we propose a novel context-aware difference distilling (CARD) network to capture all genuine changes for yielding sentences. Given an image pair, CARD first decouples context features that aggregate all similar/dissimilar semantics, termed common/difference context features. Then, the consistency and independence constraints are designed to guarantee the alignment/discrepancy of common/difference context features. Further, the common context features guide the model to mine locally unchanged features, which are subtracted from the pair to distill locally difference features. Next, the difference context features augment the locally difference features to ensure that all changes are distilled. In this way, we obtain an omni-representation of all changes, which is translated into linguistic sentences by a transformer decoder. Extensive experiments on three public datasets show CARD performs favourably against state-of-the-art methods.The code is available at https://github.com/tuyunbin/CARD.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Changes-Aware Transformer: Learning Generalized Changes Representation [56.917000244470174]
We propose a novel Changes-Aware Transformer (CAT) for refining difference features.
The generalized representation of various changes is learned straightforwardly in the difference feature space.
After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection.
arXiv Detail & Related papers (2023-09-24T12:21:57Z) - Align, Perturb and Decouple: Toward Better Leverage of Difference
Information for RSI Change Detection [24.249552791014644]
Change detection is a widely adopted technique in remote sense imagery (RSI) analysis.
We propose a series of operations to fully exploit the difference information: Alignment, Perturbation and Decoupling.
arXiv Detail & Related papers (2023-05-30T03:39:53Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - Describing and Localizing Multiple Changes with Transformers [24.138480002212994]
Change captioning tasks aim to detect changes in image pairs observed before and after a scene change.
We propose a CG-based multi-change captioning dataset.
We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning.
arXiv Detail & Related papers (2021-03-25T21:52:03Z) - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for
Change Captioning [41.044241265804125]
We propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task.
We also propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards.
Our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.
arXiv Detail & Related papers (2020-09-30T00:13:49Z) - Same Features, Different Day: Weakly Supervised Feature Learning for
Seasonal Invariance [65.94499390875046]
"Like night and day" is a commonly used expression to imply that two things are completely different.
The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval.
We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data.
arXiv Detail & Related papers (2020-03-30T12:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.