Fugu-MT 論文翻訳(概要): Self-supervised Cross-view Representation Reconstruction for Change Captioning

論文の概要: Self-supervised Cross-view Representation Reconstruction for Change Captioning

arxiv url: http://arxiv.org/abs/2309.16283v1
Date: Thu, 28 Sep 2023 09:28:50 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-29 15:20:20.255005
Title: Self-supervised Cross-view Representation Reconstruction for Change Captioning
Title（参考訳）: 変化キャプションのための自己監督型クロスビュー表現再構成
Authors: Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang
Abstract要約: 変更キャプションは、類似したイメージのペアの違いを記述することを目的としている。その主な課題は、視点変化によって引き起こされる擬似変化の下で、安定した差分表現を学習する方法である。自己教師型クロスビュー表現再構成ネットワークを提案する。
参考スコア（独自算出の注目度）: 113.08380679787247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a ``hallucination'' representation with the caption and ``before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.
Abstract（参考訳）: 変更キャプションは、類似したイメージのペアの違いを記述することを目的としている。その主な課題は、視点変化によって引き起こされる擬似変化の下で、安定した差分表現を学習する方法である。本稿では,scorer(self-supervised cross-view representation reconstruction)ネットワークを提案する。具体的には、まず、類似/異種画像からのクロスビュー特徴間の関係をモデル化するマルチヘッドトークンワイドマッチングを設計する。次に、SCORERは、2つの類似画像の相互参照コントラストアライメントを最大化することにより、2つのビュー不変画像表現を自己監督的に学習する。これらの結果に基づき、変化しないオブジェクトの表現を横断的アテンションで再構成し、キャプション生成のための安定した差分表現を学習する。さらに,キャプションの品質を向上させるために,モーダルな後方推論を考案した。このモジュールは逆に ``hallucination'' 表現をキャプションと ``before' 表現でモデル化する。この表現を `after' 表現に近づけることで、自己指導的な方法での違いを知らせるようにキャプションを強制する。広範な実験により、4つのデータセットで最新の結果が得られた。コードはhttps://github.com/tuyunbin/SCORERで公開されている。

論文の概要: Self-supervised Cross-view Representation Reconstruction for Change Captioning

関連論文リスト