Related papers: Rethinking the Reference-based Distinctive Image Captioning

Rethinking the Reference-based Distinctive Image Captioning

URL: http://arxiv.org/abs/2207.11118v1
Date: Fri, 22 Jul 2022 14:49:54 GMT
Title: Rethinking the Reference-based Distinctive Image Captioning
Authors: Yangjun Mao, Long Chen, Zhihong Jiang, Dong Zhang, Zhimeng Zhang, Jian Shao, Jun Xiao
Abstract summary: A recent work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images. We develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
Score: 17.724543105544935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

Related papers

Latent Expression Generation for Referring Image Segmentation and Grounding [13.611995923070426]
Most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain.<n>This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects.<n>We propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input.
arXiv Detail & Related papers (2025-08-07T07:57:27Z)
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention [62.246950834745796]
Group-based Differential Distinctive Captioning Method. Group-based Differential Memory Attention (GDMA) module. New evaluation metric, the Distinctive Word Rate (DisWordRate)
arXiv Detail & Related papers (2025-04-03T11:19:51Z)
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs) KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z)
Decompose Semantic Shifts for Composed Image Retrieval [38.262678009072154]
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. We propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. The proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2023-09-18T07:21:30Z)
Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression. We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z)
Improving Reference-based Distinctive Image Captioning with Contrastive Rewards [52.406331702017596]
A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images. We propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
arXiv Detail & Related papers (2023-06-25T14:37:13Z)
Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z)
IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z)
Group-based Distinctive Image Captioning with Memory Attention [45.763534774116856]
Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions. New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
arXiv Detail & Related papers (2021-08-20T12:46:36Z)
Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE) Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.