Related papers: What Drives Compositional Generalization in Visual Generative Models?

What Drives Compositional Generalization in Visual Generative Models?

URL: http://arxiv.org/abs/2510.03075v2
Date: Mon, 06 Oct 2025 10:01:02 GMT
Title: What Drives Compositional Generalization in Visual Generative Models?
Authors: Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox,
Abstract summary: We conduct a systematic study of how various design choices influence compositional generalization in image and video generation.<n>We identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training.<n>Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.
Score: 56.01574461407906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

Related papers

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models [62.76435672183968]
We introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model.<n>We propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level and at the feature level.<n>Experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines.
arXiv Detail & Related papers (2025-08-27T00:56:11Z)
Does Data Scaling Lead to Visual Compositional Generalization? [21.242714408660508]
We find that compositional generalization is driven by data diversity, not mere data scale.<n>We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations.
arXiv Detail & Related papers (2025-07-09T17:59:03Z)
Unveiling Concept Attribution in Diffusion Models [12.77092262246859]
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts.<n>Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept.<n>We decompose diffusion models using component attribution, systematically unveiling the importance of each component in generating a concept.
arXiv Detail & Related papers (2024-12-03T16:34:49Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks. Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training. This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z)
Unifying Self-Supervised Clustering and Energy-Based Models [9.3176264568834]
We establish a principled connection between self-supervised learning and generative models.<n>We show that our solution can be integrated into a neuro-symbolic framework to tackle a simple yet non-trivial instantiation of the symbol grounding problem.
arXiv Detail & Related papers (2023-12-30T04:46:16Z)
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task [18.99393947706941]
We study compositional generalization in conditional diffusion models in a synthetic setting.<n>We find that the order in which the ability to generate samples emerges is governed by the structure of the underlying data-generating process.<n>Our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.
arXiv Detail & Related papers (2023-10-13T18:00:59Z)
On Feature Diversity in Energy-based Models [98.78384185493624]
An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs.
arXiv Detail & Related papers (2023-06-02T12:30:42Z)
Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace [1.6574413179773757]
Concept-Centric Transformers is a simple yet effective configuration of the shared global workspace for interpretability. We show that our model achieves better classification accuracy than all baselines across all problems.
arXiv Detail & Related papers (2023-05-25T06:37:39Z)
Robust and Controllable Object-Centric Learning through Energy-based Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model. We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.