Related papers: YoChameleon: Personalized Vision and Language Generation

YoChameleon: Personalized Vision and Language Generation

URL: http://arxiv.org/abs/2504.20998v1
Date: Tue, 29 Apr 2025 17:59:57 GMT
Title: YoChameleon: Personalized Vision and Language Generation
Authors: Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, Yuheng Li,
Abstract summary: Yo'Chameleon is the first attempt to study personalization for large multimodal models.<n>It embeds subject-specific information to answer questions about the subject and recreate pixel-level details to produce images of the subject in new contexts.<n>It is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a soft-positive" image generation approach to enhance image quality in a few-shot setting.
Score: 54.11098551685136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

Related papers

Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation.<n>T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme.<n>Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing [79.68232381605661]
We present UniPose, a framework to comprehend, generate, and edit human poses across various modalities.<n>Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary.<n>Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
arXiv Detail & Related papers (2024-11-25T08:06:30Z)
Personalized Image Generation with Large Multimodal Models [47.289887243367055]
We propose a Personalized Image Generation Framework named Pigeon to capture users' visual preferences and needs from noisy user history and multimodal instructions.<n>We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
arXiv Detail & Related papers (2024-10-18T04:20:46Z)
PersonificationNet: Making customized subject act like a person [39.359589723267696]
We propose a PersonificationNet, which can control the specified subject such as a cartoon character or plush toy to act the same pose as a given referenced person's image. Specifically, first, the customized branch mimics specified subject appearance. Second, the pose condition branch transfers the body structure information from the human to variant instances. Last, the structure alignment module bridges the structure gap between human and specified subject in the inference stage.
arXiv Detail & Related papers (2024-07-12T07:27:07Z)
Chameleon: Mixed-Modal Early-Fusion Foundation Models [0.0]
We present a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.<n>The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
arXiv Detail & Related papers (2024-05-16T05:23:41Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users. Most existing methods emphasize the user context fusion process by memory networks or transformers. We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z)
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models. We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject. The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.