GMAIL: Generative Modality Alignment for generated Image Learning
- URL: http://arxiv.org/abs/2602.15368v1
- Date: Tue, 17 Feb 2026 05:40:25 GMT
- Title: GMAIL: Generative Modality Alignment for generated Image Learning
- Authors: Shentong Mo, Sukmin Yun,
- Abstract summary: We propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images.<n>Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments.
- Score: 51.071351994330605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.
Related papers
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer [90.72238747690972]
We present Manzano, a simple and scalable unified framework for multimodal large language models.<n>A single vision encoder feeds two adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation.<n>A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels.
arXiv Detail & Related papers (2025-09-19T17:58:00Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models [36.59260354292177]
Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models.
We aim to fine-tune vision-language models to a specific classification model without access to any real images.
Despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets.
arXiv Detail & Related papers (2024-06-08T10:43:49Z) - Diverse and Tailored Image Generation for Zero-shot Multi-label Classification [3.354528906571718]
zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations.
prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance.
We propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels.
arXiv Detail & Related papers (2024-04-04T01:34:36Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - Active Generation for Image Classification [45.93535669217115]
We propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model.
With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation.
arXiv Detail & Related papers (2024-03-11T08:45:31Z) - Unlocking Pre-trained Image Backbones for Semantic Image Synthesis [29.688029979801577]
We propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images.
Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes.
arXiv Detail & Related papers (2023-12-20T09:39:19Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.