Highly Compressed Tokenizer Can Generate Without Training
- URL: http://arxiv.org/abs/2506.08257v1
- Date: Mon, 09 Jun 2025 21:45:03 GMT
- Title: Highly Compressed Tokenizer Can Generate Without Training
- Authors: L. Lao Beyer, T. Li, X. Chen, S. Karaman, K. He,
- Abstract summary: 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens.<n>We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities.<n>Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
- Score: 0.5033155053523042
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences.<n>We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z) - Spectral Image Tokenizer [21.84385276311364]
Image tokenizers map images to sequences of discrete tokens.<n>We propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT)<n>We evaluate the tokenizer metrics as multiscale image generation, text-guided image upsampling and editing.
arXiv Detail & Related papers (2024-12-12T18:59:31Z) - Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z) - Image-GS: Content-Adaptive Image Representation via 2D Gaussians [52.598772767324036]
We introduce Image-GS, a content-adaptive image representation based on 2D Gaussians radiance.<n>It supports hardware-friendly rapid access for real-time usage, requiring only 0.3K MACs to decode a pixel.<n>We demonstrate its versatility with several applications, including texture compression, semantics-aware compression, and joint image compression and restoration.
arXiv Detail & Related papers (2024-07-02T00:45:21Z) - Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting [8.572133295533643]
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes.
Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image.
arXiv Detail & Related papers (2024-03-27T01:28:36Z) - StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis [112.25071764647683]
StrokeNUWA is a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics.
equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods.
StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.
arXiv Detail & Related papers (2024-01-30T15:20:26Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.