Abstract: Recent advances in vision-language pre-training have enabled machines to
perform better in multimodal object discrimination (e.g., image-text semantic
alignment) and image synthesis (e.g., text-to-image generation). On the other
hand, fine-tuning pre-trained models with discriminative or generative
capabilities such as CLIP and Stable Diffusion on domain-specific datasets has
shown to be effective in various tasks by adapting to specific domains.
However, few studies have explored the possibility of learning both
discriminative and generative capabilities and leveraging their synergistic
effects to create a powerful and personalized multimodal model during
fine-tuning. This paper presents UniDiff, a unified multi-modal model that
integrates image-text contrastive learning (ITC), text-conditioned image
synthesis learning (IS), and reciprocal semantic consistency modeling (RSC).
UniDiff effectively learns aligned semantics and mitigates the issue of
semantic collapse during fine-tuning on small datasets by leveraging RSC on
visual features from CLIP and diffusion models, without altering the
pre-trained model's basic architecture. UniDiff demonstrates versatility in
both multi-modal understanding and generative tasks. Experimental results on
three datasets (Fashion-man, Fashion-woman, and E-commercial Product) showcase
substantial enhancements in vision-language retrieval and text-to-image
generation, illustrating the advantages of combining discriminative and
generative fine-tuning. The proposed UniDiff model establishes a robust
pipeline for personalized modeling and serves as a benchmark for future
comparisons in the field.