Product of Experts for Visual Generation
- URL: http://arxiv.org/abs/2506.08894v2
- Date: Thu, 09 Oct 2025 01:37:47 GMT
- Title: Product of Experts for Visual Generation
- Authors: Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu,
- Abstract summary: We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models.<n>Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods.
- Score: 60.91134809173301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
Related papers
- Training Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs [3.686386213696443]
We introduce a framework for interpreting generative outputs through the automatic construction of knowledge graphs.<n>Our method extracts structured triples from images, aligned with a domain-specific ontology.<n>By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI.
arXiv Detail & Related papers (2025-12-02T12:45:20Z) - Personalized Image Generation with Deep Generative Models: A Decade Survey [51.26287478042516]
We present a review of generalized personalized image generation across various generative models.<n>We first define a unified framework that standardizes the personalization process across different generative models.<n>We then provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations.
arXiv Detail & Related papers (2025-02-18T17:34:04Z) - Generative AI for Vision: A Comprehensive Study of Frameworks and Applications [0.0]
Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals.<n>This work presents a structured classification of image generation techniques based on the nature of the input.<n>We highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent.
arXiv Detail & Related papers (2025-01-29T22:42:05Z) - A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys) [57.30228361181045]
This survey connects key advancements in recommender systems using Generative Models (Gen-RecSys)
It covers: interaction-driven generative models; the use of large language models (LLM) and textual data for natural language recommendation; and the integration of multimodal models for generating and processing images/videos in RS.
Our work highlights necessary paradigms for evaluating the impact and harm of Gen-RecSys and identifies open challenges.
arXiv Detail & Related papers (2024-03-31T06:57:57Z) - Can Generative Models Improve Self-Supervised Representation Learning? [0.7999703756441756]
We introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations.<n>Our results show that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks.
arXiv Detail & Related papers (2024-03-09T17:17:07Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Learning semantic Image attributes using Image recognition and knowledge
graph embeddings [0.3222802562733786]
We propose a shared learning approach to learn semantic attributes of images by combining a knowledge graph embedding model with the recognized attributes of images.
The proposed approach is a step towards bridging the gap between frameworks which learn from large amounts of data and frameworks which use a limited set of predicates to infer new knowledge.
arXiv Detail & Related papers (2020-09-12T15:18:48Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.