Rethinking the Reference-based Distinctive Image Captioning
- URL: http://arxiv.org/abs/2207.11118v1
- Date: Fri, 22 Jul 2022 14:49:54 GMT
- Title: Rethinking the Reference-based Distinctive Image Captioning
- Authors: Yangjun Mao, Long Chen, Zhihong Jiang, Dong Zhang, Zhimeng Zhang, Jian
Shao, Jun Xiao
- Abstract summary: A recent work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images.
We develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC.
For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
- Score: 17.724543105544935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distinctive Image Captioning (DIC) -- generating distinctive captions that
describe the unique details of a target image -- has received considerable
attention over the last few years. A recent DIC work proposes to generate
distinctive captions by comparing the target image with a set of
semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims
to make the generated captions can tell apart the target and reference images.
Unfortunately, reference images used by existing Ref-DIC works are easy to
distinguish: these reference images only resemble the target image at
scene-level and have few common objects, such that a Ref-DIC model can
trivially generate distinctive captions even without considering the reference
images. To ensure Ref-DIC models really perceive the unique objects (or
attributes) in target images, we first propose two new Ref-DIC benchmarks.
Specifically, we design a two-stage matching mechanism, which strictly controls
the similarity between the target and reference images at object-/attribute-
level (vs. scene-level). Secondly, to generate distinctive captions, we develop
a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only
extracts visual features from the target image, but also encodes the
differences between objects in the target and reference images. Finally, for
more trustworthy benchmarking, we propose a new evaluation metric named
DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of
the generated captions. Experimental results demonstrate that our TransDIC can
generate distinctive captions. Besides, it outperforms several state-of-the-art
models on the two new benchmarks over different metrics.
Related papers
- Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - Decompose Semantic Shifts for Composed Image Retrieval [38.262678009072154]
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image.
We propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image.
The proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2023-09-18T07:21:30Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Improving Reference-based Distinctive Image Captioning with Contrastive
Rewards [52.406331702017596]
A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images.
We propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC.
For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
arXiv Detail & Related papers (2023-06-25T14:37:13Z) - IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Group-based Distinctive Image Captioning with Memory Attention [45.763534774116856]
Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions.
New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
arXiv Detail & Related papers (2021-08-20T12:46:36Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.