ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions
- URL: http://arxiv.org/abs/2301.02160v2
- Date: Mon, 1 Jul 2024 22:14:46 GMT
- Title: ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions
- Authors: Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee,
- Abstract summary: Real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions.
We launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts.
We show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features.
- Score: 6.066100464517522
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Advancements in Text-to-Image synthesis over recent years have focused more on improving the quality of generated samples using datasets with descriptive prompts. However, real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions. With captions containing information on both the image content and underlying contextual cues, they become abstractive in nature. In this paper, we launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts. We explore the capabilities of current Text-to-Image synthesis models to generate news domain-specific images using abstractive captions by benchmarking them on ANNA, in both standard training and transfer learning settings. The generated images are judged on the basis of contextual relevance, visual quality, and perceptual similarity to ground-truth image-caption pairs. Through our experiments, we show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features. The Dataset is available at https://github.com/aashish2000/ANNA .
Related papers
- ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis [6.066100464517522]
We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
arXiv Detail & Related papers (2024-04-15T21:19:10Z) - VIXEN: Visual Text Comparison Network for Image Difference Captioning [58.16313862434814]
We present VIXEN, a technique that succinctly summarizes in text the visual differences between a pair of images.
Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model.
arXiv Detail & Related papers (2024-02-29T12:56:18Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.