Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
- URL: http://arxiv.org/abs/2501.07730v2
- Date: Sat, 02 Aug 2025 07:11:16 GMT
- Title: Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
- Authors: Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen,
- Abstract summary: We introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok)<n>TA-TiTok is an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens.<n>We also introduce a family of text-to-image Masked Generative Models (MaskGen) trained exclusively on open data.
- Score: 46.361925096761915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
Related papers
- UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation [15.585320469279813]
Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding.<n>We propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps.<n>UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead.
arXiv Detail & Related papers (2025-08-07T13:51:17Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models.
Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning.
Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.