Fine-Grained Image Captioning with Global-Local Discriminative Objective
- URL: http://arxiv.org/abs/2007.10662v1
- Date: Tue, 21 Jul 2020 08:46:02 GMT
- Title: Fine-Grained Image Captioning with Global-Local Discriminative Objective
- Authors: Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, Liang Lin
- Abstract summary: We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
- Score: 80.73827423555655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in recent years in image captioning, an
active topic in the fields of vision and language. However, existing methods
tend to yield overly general captions and consist of some of the most frequent
words/phrases, resulting in inaccurate and indistinguishable descriptions (see
Figure 1). This is primarily due to (i) the conservative characteristic of
traditional training objectives that drives the model to generate correct but
hardly discriminative captions for similar images and (ii) the uneven word
distribution of the ground-truth captions, which encourages generating highly
frequent words/phrases while suppressing the less frequent but more concrete
ones. In this work, we propose a novel global-local discriminative objective
that is formulated on top of a reference model to facilitate generating
fine-grained descriptive captions. Specifically, from a global perspective, we
design a novel global discriminative constraint that pulls the generated
sentence to better discern the corresponding image from all others in the
entire dataset. From the local perspective, a local discriminative constraint
is proposed to increase attention such that it emphasizes the less frequent but
more concrete words/phrases, thus facilitating the generation of captions that
better describe the visual details of the given images. We evaluate the
proposed method on the widely used MS-COCO dataset, where it outperforms the
baseline methods by a sizable margin and achieves competitive performance over
existing leading approaches. We also conduct self-retrieval experiments to
demonstrate the discriminability of the proposed method.
Related papers
- StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation [18.213286385769525]
CycleGAN-based methods are known to hide the mismatched information in the generated images to bypass cycle consistency objectives.
We introduce StegoGAN, a novel model that leverages steganography to prevent spurious features in generated images.
Our approach enhances the semantic consistency of the translated images without requiring additional postprocessing or supervision.
arXiv Detail & Related papers (2024-03-29T12:23:58Z) - LDCA: Local Descriptors with Contextual Augmentation for Few-Shot
Learning [0.0]
We introduce a novel approach termed "Local Descriptor with Contextual Augmentation (LDCA)"
LDCA bridges the gap between local and global understanding by leveraging an adaptive global contextual enhancement module.
Experiments underscore the efficacy of our method, showing a maximal absolute improvement of 20% over the next-best on fine-grained classification datasets.
arXiv Detail & Related papers (2024-01-24T14:44:48Z) - Semi-supervised Semantic Segmentation Meets Masked Modeling:Fine-grained
Locality Learning Matters in Consistency Regularization [31.333862320143968]
Semi-supervised semantic segmentation aims to utilize limited labeled images and abundant unlabeled images to achieve label-efficient learning.
We propose a novel framework called textttMaskMatch, which enables fine-grained locality learning to achieve better dense segmentation.
arXiv Detail & Related papers (2023-12-14T03:28:53Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Switching to Discriminative Image Captioning by Relieving a Bottleneck
of Reinforcement Learning [24.676231888909097]
We investigate the cause of the unexpectedly low discriminativeness and show that RL has a deeply rooted side effect of limiting the output words to high-frequency words.
We drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation.
Our methods significantly enhance the discriminativeness of off-the-shelf RL models and even outperform previous discriminativeness-aware methods with much smaller computational costs.
arXiv Detail & Related papers (2022-12-06T18:55:20Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach.
We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z) - Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.