ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2404.10141v1
- Date: Mon, 15 Apr 2024 21:19:10 GMT
- Title: ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
- Authors: Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee,
- Abstract summary: We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
- Score: 6.066100464517522
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning [55.033327333250455]
News image captioning requires model to generate an informative caption rich in entities, with the news image and the associated news article.
Current Multimodal Large Language Models (MLLMs) still bear limitations in handling entity information on news image captioning task.
Our approach achieves better results than all previous models in CIDEr score on GoodNews dataset (72.33 -> 88.39) and NYTimes800k dataset (70.83 -> 85.61)
arXiv Detail & Related papers (2024-02-29T18:03:00Z) - Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning [40.01197694624958]
We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
arXiv Detail & Related papers (2023-06-01T17:34:25Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions [6.066100464517522]
Real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions.
We launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts.
We show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features.
arXiv Detail & Related papers (2023-01-05T17:19:01Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.