Switching to Discriminative Image Captioning by Relieving a Bottleneck
of Reinforcement Learning
- URL: http://arxiv.org/abs/2212.03230v1
- Date: Tue, 6 Dec 2022 18:55:20 GMT
- Title: Switching to Discriminative Image Captioning by Relieving a Bottleneck
of Reinforcement Learning
- Authors: Ukyo Honda, Taro Watanabe, Yuji Matsumoto
- Abstract summary: We investigate the cause of the unexpectedly low discriminativeness and show that RL has a deeply rooted side effect of limiting the output words to high-frequency words.
We drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation.
Our methods significantly enhance the discriminativeness of off-the-shelf RL models and even outperform previous discriminativeness-aware methods with much smaller computational costs.
- Score: 24.676231888909097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discriminativeness is a desirable feature of image captions: captions should
describe the characteristic details of input images. However, recent
high-performing captioning models, which are trained with reinforcement
learning (RL), tend to generate overly generic captions despite their high
performance in various other criteria. First, we investigate the cause of the
unexpectedly low discriminativeness and show that RL has a deeply rooted side
effect of limiting the output words to high-frequency words. The limited
vocabulary is a severe bottleneck for discriminativeness as it is difficult for
a model to describe the details beyond its vocabulary. Then, based on this
identification of the bottleneck, we drastically recast discriminative image
captioning as a much simpler task of encouraging low-frequency word generation.
Hinted by long-tail classification and debiasing methods, we propose methods
that easily switch off-the-shelf RL models to discriminativeness-aware models
with only a single-epoch fine-tuning on the part of the parameters. Extensive
experiments demonstrate that our methods significantly enhance the
discriminativeness of off-the-shelf RL models and even outperform previous
discriminativeness-aware methods with much smaller computational costs.
Detailed analysis and human evaluation also verify that our methods boost the
discriminativeness without sacrificing the overall quality of captions.
Related papers
- Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers [31.455819448471157]
Generative training has been demonstrated to be powerful for building visual-language models.
On zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives.
In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - Pragmatic Inference with a CLIP Listener for Contrastive Captioning [10.669625017690658]
We propose a method for generating discriminative captions that distinguish target images from very similar alternative distractor images.
Our approach is built on a pragmatic inference procedure that formulates captioning as a reference game between a speaker and a listener.
arXiv Detail & Related papers (2023-06-15T02:22:28Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Enhancing Fine-Grained Classification for Low Resolution Images [97.82441158440527]
Low resolution images suffer from the inherent challenge of limited information content and the absence of fine details useful for sub-category classification.
This research proposes a novel attribute-assisted loss, which utilizes ancillary information to learn discriminative features for classification.
The proposed loss function enables a model to learn class-specific discriminative features, while incorporating attribute-level separability.
arXiv Detail & Related papers (2021-05-01T13:19:02Z) - Discriminatively-Tuned Generative Classifiers for Robust Natural
Language Inference [59.62779187457773]
We propose a generative classifier for natural language inference (NLI)
We compare it to five baselines, including discriminative models and large-scale pretrained language representation models like BERT.
Experiments show that GenNLI outperforms both discriminative and pretrained baselines across several challenging NLI experimental settings.
arXiv Detail & Related papers (2020-10-08T04:44:00Z) - Discriminative Residual Analysis for Image Set Classification with
Posture and Age Variations [27.751472312581228]
Discriminant Residual Analysis (DRA) is proposed to improve the classification performance.
DRA attempts to obtain a powerful projection which casts the residual representations into a discriminant subspace.
Two regularization approaches are used to deal with the probable small sample size problem.
arXiv Detail & Related papers (2020-08-23T08:53:06Z) - Blind Face Restoration via Deep Multi-scale Component Dictionaries [75.02640809505277]
We propose a deep face dictionary network (termed as DFDNet) to guide the restoration process of degraded observations.
DFDNet generates deep dictionaries for perceptually significant face components from high-quality images.
component AdaIN is leveraged to eliminate the style diversity between the input and dictionary features.
arXiv Detail & Related papers (2020-08-02T07:02:07Z) - Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.