Sample- and Parameter-Efficient Auto-Regressive Image Models
- URL: http://arxiv.org/abs/2411.15648v1
- Date: Sat, 23 Nov 2024 20:40:46 GMT
- Title: Sample- and Parameter-Efficient Auto-Regressive Image Models
- Authors: Elad Amrani, Leonid Karlinsky, Alex Bronstein,
- Abstract summary: We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective.
XTRA employs a Block Causal Mask, where each Block represents k $times$ k tokens rather than relying on a standard causal mask.
By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions.
- Score: 15.501863812794209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k $\times$ k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152$\times$ fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16$\times$ fewer parameters (85M vs. 1.36B/0.63B).
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.
We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation [34.112157859384645]
Autoregressive (AR) modeling underpins state-of-the-art language and visual generative models.
Traditionally, a token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision.
We propose xAR, a framework that extends the notion of a token to an entity X.
arXiv Detail & Related papers (2025-02-27T18:59:08Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks.
Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation.
We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - Autoregressive Image Generation without Vector Quantization [31.798754606008067]
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens.
We propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Analog Bits: Generating Discrete Data using Diffusion Models with
Self-Conditioning [90.02873747873444]
Bit Diffusion is a generic approach for generating discrete data with continuous diffusion models.
The proposed approach can achieve strong performance in both discrete image generation and image captioning tasks.
For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.
arXiv Detail & Related papers (2022-08-08T15:08:40Z) - Aligned Cross Entropy for Non-Autoregressive Machine Translation [120.15069387374717]
We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
arXiv Detail & Related papers (2020-04-03T16:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.