MC-LLaVA: Multi-Concept Personalized Vision-Language Model
- URL: http://arxiv.org/abs/2503.18854v2
- Date: Tue, 25 Mar 2025 13:50:20 GMT
- Title: MC-LLaVA: Multi-Concept Personalized Vision-Language Model
- Authors: Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang,
- Abstract summary: This paper proposes the first multi-concept personalization paradigm, MC-LLaVA.<n>MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step.<n> Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses.
- Score: 51.645660375766575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA}.
Related papers
- ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning [40.70596166863986]
Multi-Concept Video Customization (MCVC) remains a significant challenge.
We introduce ConceptMaster, an innovative framework that effectively tackles the issues of identity decoupling while maintaining concept fidelity in customized videos.
Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner.
arXiv Detail & Related papers (2025-01-08T18:59:01Z) - MC-LLaVA: Multi-Concept Personalized Vision-Language Model [44.325777035345695]
Current vision-language models (VLMs) show exceptional abilities across diverse tasks including visual question answering.<n>We propose the first multi-concept personalization method named MC-LLaVA along with a high-quality multi-concept personalization dataset.<n>We conduct comprehensive qualitative and quantitative experiments to demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses.
arXiv Detail & Related papers (2024-11-18T16:33:52Z) - FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition [49.2208591663092]
FreeCustom is a tuning-free method to generate customized images of multi-concept composition based on reference concepts.
We introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy.
Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization.
arXiv Detail & Related papers (2024-05-22T17:53:38Z) - MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [59.00909718832648]
We propose MC$2$, a novel approach for multi-concept customization.<n>By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts.<n>Experiments demonstrate that MC$2$ outperforms training-based methods in terms of prompt-reference alignment.
arXiv Detail & Related papers (2024-04-08T07:59:04Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)
Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.
In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - MyVLM: Personalizing VLMs for User-Specific Queries [78.33252556805931]
We take a first step toward the personalization of vision-language models, enabling them to learn and reason over user-provided concepts.
To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model.
Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM.
This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response.
arXiv Detail & Related papers (2024-03-21T17:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.