Related papers: DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

URL: http://arxiv.org/abs/2405.15232v2
Date: Wed, 3 Jul 2024 17:30:53 GMT
Title: DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Authors: Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui,
Abstract summary: We propose DEEM, a simple and effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. DEEM exhibits enhanced robustness and a superior capacity to alleviate hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
Score: 66.88792390480343
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple and effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and another well-known benchmark, POPE, for object hallucination. Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.

Related papers

Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation. We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly. Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
Simplified and Generalized Masked Diffusion for Discrete Data [47.711583631408715]
Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models.
arXiv Detail & Related papers (2024-06-06T17:59:10Z)
Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA) Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
Large-scale Reinforcement Learning for Diffusion Models [30.164571425479824]
Text-to-image diffusion models are susceptible to implicit biases that arise from web-scale text-image training pairs. We present an effective scalable algorithm to improve diffusion models using Reinforcement Learning (RL) We show how our approach substantially outperforms existing methods for aligning diffusion models with human preferences.
arXiv Detail & Related papers (2024-01-20T08:10:43Z)
Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
Stable Optimization for Large Vision Model Based Deep Image Prior in Cone-Beam CT Reconstruction [6.558735319783205]
Large Vision Model (LVM) has recently demonstrated great potential for medical imaging tasks. Deep Image Prior (DIP) effectively guides an untrained neural network to generate high-quality CBCT images without any training data. We propose a stable optimization method for the forward-model-free DIP model for sparse-view CBCT.
arXiv Detail & Related papers (2022-03-23T15:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.