Related papers: TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

URL: http://arxiv.org/abs/2503.15283v1
Date: Wed, 19 Mar 2025 15:03:19 GMT
Title: TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models
Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai,
Abstract summary: Training-Free Text-and-Image-to-Image (TF-TI2I) adapts cutting-edge T2I models without the need for additional training.<n>Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
Score: 19.1659725630146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

Related papers

End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings [5.217870815854702]
We study an approach to learning text embeddings specifically tailored to the Text-to-Image synthesis network.<n>We combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment.<n>A comprehensive set of experiments on three text-to-image benchmark datasets reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach.
arXiv Detail & Related papers (2025-02-03T16:40:47Z)
Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP [22.33658954569737]
We build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component and a Text-Guided-Image (TGI) component.<n>Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method.<n>We propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost.
arXiv Detail & Related papers (2024-12-16T02:03:45Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets. We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks. We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z)
TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z)
TaleCrafter: Interactive Story Visualization with Multiple Characters [49.14122401339003]
This paper proposes a system for generic interactive story visualization. It is capable of handling multiple novel characters and supporting the editing of layout and local structure. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
arXiv Detail & Related papers (2023-05-29T17:11:39Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME) TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.