Macroscopic Control of Text Generation for Image Captioning
- URL: http://arxiv.org/abs/2101.08000v1
- Date: Wed, 20 Jan 2021 07:20:07 GMT
- Title: Macroscopic Control of Text Generation for Image Captioning
- Authors: Zhangzi Zhu, Tianlei Wang, and Hong Qu
- Abstract summary: Two novel methods are introduced to solve the problems respectively.
For the former problem, we introduce a control signal which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc.
For the latter problem, we innovatively propose a strategy that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one.
- Score: 4.742874328556818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the fact that image captioning models have been able to generate
impressive descriptions for a given image, challenges remain: (1) the
controllability and diversity of existing models are still far from
satisfactory; (2) models sometimes may produce extremely poor-quality captions.
In this paper, two novel methods are introduced to solve the problems
respectively. Specifically, for the former problem, we introduce a control
signal which can control the macroscopic sentence attributes, such as sentence
quality, sentence length, sentence tense and number of nouns etc. With such a
control signal, the controllability and diversity of existing captioning models
are enhanced. For the latter problem, we innovatively propose a strategy that
an image-text matching model is trained to measure the quality of sentences
generated in both forward and backward directions and finally choose the better
one. As a result, this strategy can effectively reduce the proportion of
poorquality sentences. Our proposed methods can be easily applie on most image
captioning models to improve their overall performance. Based on the Up-Down
model, the experimental results show that our methods achieve BLEU-
4/CIDEr/SPICE scores of 37.5/120.3/21.5 on MSCOCO Karpathy test split with
cross-entropy training, which surpass the results of other state-of-the-art
methods trained by cross-entropy loss.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z) - Comprehensive Image Captioning via Scene Graph Decomposition [51.660090468384375]
We address the challenging problem of image captioning by revisiting the representation of image scene graph.
At the core of our method lies the decomposition of a scene graph into a set of sub-graphs.
We design a deep model to select important sub-graphs, and to decode each selected sub-graph into a single target sentence.
arXiv Detail & Related papers (2020-07-23T00:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.