UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning
- URL: http://arxiv.org/abs/2306.00813v1
- Date: Thu, 1 Jun 2023 15:39:38 GMT
- Title: UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning
- Authors: Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian
Yin, Xiaodan Liang
- Abstract summary: This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
- Score: 86.91893533388628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language pre-training have enabled machines to
perform better in multimodal object discrimination (e.g., image-text semantic
alignment) and image synthesis (e.g., text-to-image generation). On the other
hand, fine-tuning pre-trained models with discriminative or generative
capabilities such as CLIP and Stable Diffusion on domain-specific datasets has
shown to be effective in various tasks by adapting to specific domains.
However, few studies have explored the possibility of learning both
discriminative and generative capabilities and leveraging their synergistic
effects to create a powerful and personalized multimodal model during
fine-tuning. This paper presents UniDiff, a unified multi-modal model that
integrates image-text contrastive learning (ITC), text-conditioned image
synthesis learning (IS), and reciprocal semantic consistency modeling (RSC).
UniDiff effectively learns aligned semantics and mitigates the issue of
semantic collapse during fine-tuning on small datasets by leveraging RSC on
visual features from CLIP and diffusion models, without altering the
pre-trained model's basic architecture. UniDiff demonstrates versatility in
both multi-modal understanding and generative tasks. Experimental results on
three datasets (Fashion-man, Fashion-woman, and E-commercial Product) showcase
substantial enhancements in vision-language retrieval and text-to-image
generation, illustrating the advantages of combining discriminative and
generative fine-tuning. The proposed UniDiff model establishes a robust
pipeline for personalized modeling and serves as a benchmark for future
comparisons in the field.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - OneDiff: A Generalist Model for Image Difference Captioning [5.71214984158106]
Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images.
OneDiff is a novel generalist approach that utilizes a robust vision-language model architecture.
OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability.
arXiv Detail & Related papers (2024-07-08T06:14:37Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Can Generative Models Improve Self-Supervised Representation Learning? [0.7999703756441756]
We introduce a novel framework that enriches the self-supervised learning paradigm by utilizing generative models to produce semantically consistent image augmentations.
Our results show that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks.
arXiv Detail & Related papers (2024-03-09T17:17:07Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.