Controllable Textual Inversion for Personalized Text-to-Image Generation
- URL: http://arxiv.org/abs/2304.05265v3
- Date: Sun, 24 Sep 2023 16:42:50 GMT
- Title: Controllable Textual Inversion for Personalized Text-to-Image Generation
- Authors: Jianan Yang, Haobo Wang, Yanming Zhang, Ruixuan Xiao, Sai Wu, Gang
Chen, Junbo Zhao
- Abstract summary: Text inversion (TI) is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens.
In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI) in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework.
- Score: 24.18758951295929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent large-scale generative modeling has attained unprecedented
performance especially in producing high-fidelity images driven by text
prompts. Text inversion (TI), alongside the text-to-image model backbones, is
proposed as an effective technique in personalizing the generation when the
prompts contain user-defined, unseen or long-tail concept tokens. Despite that,
we find and show that the deployment of TI remains full of "dark-magics" -- to
name a few, the harsh requirement of additional datasets, arduous human efforts
in the loop and lack of robustness. In this work, we propose a much-enhanced
version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all
the aforementioned problems and in turn delivering a robust, data-efficient and
easy-to-use framework. The core to COTI is a theoretically-guided loss
objective instantiated with a comprehensive and novel weighted scoring
mechanism, encapsulated by an active-learning paradigm. The extensive results
show that COTI significantly outperforms the prior TI-related approaches with a
26.05 decrease in the FID score and a 23.00% boost in the R-precision.
Related papers
- RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery [69.41989381702858]
Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency.
We propose RAPID, an efficient retrieval-augmented long text generation framework.
Our work provides a robust and efficient solution to the challenges of automated long-text generation.
arXiv Detail & Related papers (2025-03-02T06:11:29Z) - TIPO: Text to Image with Text Presampling for Prompt Optimization [17.312386194139652]
TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation.
Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions.
arXiv Detail & Related papers (2024-11-12T19:09:45Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning [0.0]
We propose a fine-tuning frame-work that leverages.
Efficient Fine-Tuning (PEFT) techniques.
We demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.
arXiv Detail & Related papers (2024-05-07T08:50:25Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing
Learning Efficiency [5.006064616335817]
This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT.
Our paradigm to produce DONUT-hole, reduces the model denisty by 54% while preserving performance.
arXiv Detail & Related papers (2023-11-09T22:49:05Z) - CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought [3.0566617373924325]
This paper presents CoT-BERT, an innovative method that harnesses the progressive thinking of Chain-of-supervised reasoning.
We develop an advanced contrastive learning loss function and propose a novel template denoising strategy.
arXiv Detail & Related papers (2023-09-20T08:42:06Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis [54.39789900854696]
StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
It significantly improves over previous GANs and outperforms distilled diffusion models in terms of sample quality and speed.
arXiv Detail & Related papers (2023-01-23T16:05:45Z) - Hierarchical and Efficient Learning for Person Re-Identification [19.172946887940874]
We propose a novel Hierarchical and Efficient Network (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations.
We also propose a new dataset augmentation approach, dubbed Random Polygon Erasing (RPE), to random erase irregular area of the input image for imitating the body part missing.
arXiv Detail & Related papers (2020-05-18T15:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.