Related papers: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

URL: http://arxiv.org/abs/2501.13718v2
Date: Thu, 18 Sep 2025 07:48:33 GMT
Title: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation
Authors: Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio,
Abstract summary: In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images.<n>We propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric.<n>By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data.
Score: 38.44755687012033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics remain only empirically observed, without a systematic understanding of each latent variable's impact. In this work, we propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric. Our analysis reveals that current MLVGMs often underutilize some latent variables, and provides actionable insights for their use in downstream applications. With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data. Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs' generated views compete on par with or even surpass views generated from real data. This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning. Code and pre-trained models at: https://github.com/SerezD/mi_ml_gen.

Related papers

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN) Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Incrementally Learning Multiple Diverse Data Domains via Multi-Source Dynamic Expansion Model [16.035374682124846]
Continual Learning seeks to develop a model capable of incrementally assimilating new information while retaining prior knowledge. This paper shifts focus to a more complex and realistic learning environment, characterized by data samples sourced from multiple distinct domains.
arXiv Detail & Related papers (2025-01-15T15:49:46Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements. We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
Can Generative Models Improve Self-Supervised Representation Learning? [0.7999703756441756]
We introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations.<n>Our results show that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks.
arXiv Detail & Related papers (2024-03-09T17:17:07Z)
Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Learning Joint Latent Space EBM Prior Model for Multi-layer Generator [44.4434704520236]
We study the fundamental problem of learning multi-layer generator models. We propose an energy-based model (EBM) on the joint latent space over all layers of latent variables. Our experiments demonstrate that the learned model can be expressive in generating high-quality images.
arXiv Detail & Related papers (2023-06-10T00:27:37Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks. Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z)
Local Manifold Augmentation for Multiview Semantic Consistency [40.28906509638541]
We propose to extract the underlying data variation from datasets and construct a novel augmentation operator, named local manifold augmentation (LMA) LMA shows the ability to create an infinite number of data views, preserve semantics, and simulate complicated variations in object pose, viewpoint, lighting condition, background etc.
arXiv Detail & Related papers (2022-11-05T02:00:13Z)
The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning [62.601681746034956]
Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each.
arXiv Detail & Related papers (2022-09-18T18:15:38Z)
Generalized Hidden Parameter MDPs Transferable Model-based RL in a Handful of Trials [13.051708608864539]
Generalized Hidden MDPs (GHP-MDPs) describe a family of MDPs where both dynamics and reward can change as a function of hidden parameters that vary across tasks. We experimentally demonstrate state-of-the-art performance and sample-efficiency on a new challenging MuJoCo task using reward and dynamics latent spaces.
arXiv Detail & Related papers (2020-02-08T02:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.