Related papers: FairCoT: Enhancing Fairness in Text-to-Image Generation via Chain of Thought Reasoning with Multimodal Large Language Models

Related papers

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z)
Unified Multimodal Discrete Diffusion [78.48930545306654]
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches. We explore discrete diffusion models as a unified generative formulation in the joint text and image domain. We present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images.
arXiv Detail & Related papers (2025-03-26T17:59:51Z)
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z)
Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability [15.155556606996994]
Narrowing Information Bottleneck Theory is a novel framework that redefines the traditional bottleneck approach. Our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%.
arXiv Detail & Related papers (2025-02-16T19:01:37Z)
Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models. Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process. Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z)
A Comprehensive Social Bias Audit of Contrastive Vision Language Models [14.632649933582648]
We introduce FairCoT, a novel framework that enhances fairness in text-to-image models.<n>We show that FairCoT significantly enhances fairness and diversity without sacrificing image quality or semantic fidelity.
arXiv Detail & Related papers (2025-01-22T21:08:30Z)
Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models [0.7366405857677226]
Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy. VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features. It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
arXiv Detail & Related papers (2025-01-01T18:27:13Z)
Bias Begets Bias: The Impact of Biased Embeddings on Diffusion Models [0.0]
Text-to-Image (TTI) systems have come under increased scrutiny for social biases. We investigate embedding spaces as a source of bias for TTI models. We find that biased multimodal embeddings like CLIP can result in lower alignment scores for representationally balanced TTI models.
arXiv Detail & Related papers (2024-09-15T01:09:55Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Enhancing Fairness in Neural Networks Using FairVIC [0.0]
Mitigating bias in automated decision-making systems, specifically deep learning models, is a critical challenge in achieving fairness. We introduce FairVIC, an innovative approach designed to enhance fairness in neural networks by addressing inherent biases at the training stage. We observe a significant improvement in fairness across all metrics tested, without compromising the model's accuracy to a detrimental extent.
arXiv Detail & Related papers (2024-04-28T10:10:21Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Fair Text-to-Image Diffusion via Fair Mapping [32.02815667307623]
We propose a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model. By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.
arXiv Detail & Related papers (2023-11-29T15:02:01Z)
Steering Language Generation: Harnessing Contrastive Expert Guidance and Negative Prompting for Coherent and Diverse Synthetic Data Generation [0.0]
Large Language Models (LLMs) hold immense potential to generate synthetic data of high quality and utility. We introduce contrastive expert guidance, where the difference between the logit distributions of fine-tuned and base language models is emphasised. We deem this dual-pronged approach to logit reshaping as STEER: Semantic Text Enhancement via Embedding Repositioning.
arXiv Detail & Related papers (2023-08-15T08:49:14Z)
TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition [13.087647740473205]
TF-ICON is a framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets.
arXiv Detail & Related papers (2023-07-24T02:50:44Z)
Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Real-World Image Variation by Aligning Diffusion Inversion Chain [53.772004619296794]
A domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. We propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain.
arXiv Detail & Related papers (2023-05-30T04:09:47Z)
DualFair: Fair Representation Learning at Both Group and Individual Levels via Contrastive Self-supervision [73.80009454050858]
This work presents a self-supervised model, called DualFair, that can debias sensitive attributes like gender and race from learned representations. Our model jointly optimize for two fairness criteria - group fairness and counterfactual fairness.
arXiv Detail & Related papers (2023-03-15T07:13:54Z)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.