Related papers: CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

URL: http://arxiv.org/abs/2408.15914v1
Date: Wed, 28 Aug 2024 16:27:58 GMT
Title: CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization
Authors: Feize Wu, Yun Pang, Junyi Zhang, Lianyu Pang, Jian Yin, Baoquan Zhao, Qing Li, Xudong Mao,
Abstract summary: We introduce Context Regularization (CoRe), which enhances the learning of the new concept's text embedding by regularizing its context tokens in the prompt. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment.
Score: 14.01847471143144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-image personalization have enabled high-quality and controllable image synthesis for user-provided concepts. However, existing methods still struggle to balance identity preservation with text alignment. Our approach is based on the fact that generating prompt-aligned images requires a precise semantic understanding of the prompt, which involves accurately processing the interactions between the new concept and its surrounding context tokens within the CLIP text encoder. To address this, we aim to embed the new concept properly into the input embedding space of the text encoder, allowing for seamless integration with existing tokens. We introduce Context Regularization (CoRe), which enhances the learning of the new concept's text embedding by regularizing its context tokens in the prompt. This is based on the insight that appropriate output vectors of the text encoder for the context tokens can only be achieved if the new concept's text embedding is correctly learned. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images, thus improving the generalization of the learned text embedding. Additionally, CoRe can serve as a test-time optimization technique to further enhance the generations for specific prompts. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment. Code will be made publicly available.

Related papers

Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text [17.35793995814643]
We propose a novel Text-Augmented Codebook Learning framework, named TA-VQ. It generates longer text for each image using the visual-language model for improved text-aligned codebook learning. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence.
arXiv Detail & Related papers (2025-03-03T07:38:18Z)
Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs) Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
CoAPT: Context Attribute words for Prompt Tuning [5.811993982861212]
We propose a novel prompt tuning method called CoAPT for few/zero-shot image classification. The core motivation is that attributes are descriptive words with rich information about a given concept. CoAPT integrates words as additional prompts within learnable prompt tuning and can be easily incorporated into various existing prompt tuning methods.
arXiv Detail & Related papers (2024-07-18T08:58:01Z)
Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace [52.24866347353916]
We propose an efficient method to explore the target embedding in a textual subspace. We also propose an efficient selection strategy for determining the basis of the textual subspace. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
arXiv Detail & Related papers (2024-06-30T06:41:21Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation [59.44301617306483]
We propose a learning-based encoder for fast and accurate customized text-to-image generation. Our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
arXiv Detail & Related papers (2023-02-27T14:49:53Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.