Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers
- URL: http://arxiv.org/abs/2111.03481v1
- Date: Fri, 5 Nov 2021 12:57:50 GMT
- Title: Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers
- Authors: Yanhong Zeng, Huan Yang, Hongyang Chao, Jianbo Wang, Jianlong Fu
- Abstract summary: We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
- Score: 51.581926074686535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new perspective of achieving image synthesis by viewing this
task as a visual token generation problem. Different from existing paradigms
that directly synthesize a full image from a single input (e.g., a latent
code), the new formulation enables a flexible local manipulation for different
image regions, which makes it possible to learn content-aware and fine-grained
style control for image synthesis. Specifically, it takes as input a sequence
of latent tokens to predict the visual tokens for synthesizing an image. Under
this perspective, we propose a token-based generator (i.e.,TokenGAN).
Particularly, the TokenGAN inputs two semantically different visual tokens,
i.e., the learned constant content tokens and the style tokens from the latent
space. Given a sequence of style tokens, the TokenGAN is able to control the
image synthesis by assigning the styles to the content tokens by attention
mechanism with a Transformer. We conduct extensive experiments and show that
the proposed TokenGAN has achieved state-of-the-art results on several
widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with
different resolutions. In particular, the generator is able to synthesize
high-fidelity images with 1024x1024 size, dispensing with convolutions
entirely.
Related papers
- Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z) - Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework.
We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z) - Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting [8.572133295533643]
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes.
Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image.
arXiv Detail & Related papers (2024-03-27T01:28:36Z) - Vision Transformers with Mixed-Resolution Tokenization [34.18534105043819]
Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches.
We introduce a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens.
Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution.
arXiv Detail & Related papers (2023-04-01T10:39:46Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis [77.23998762763078]
We present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis.
Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output.
We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image.
arXiv Detail & Related papers (2022-08-29T17:37:29Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - CoGS: Controllable Generation and Search from Sketch and Style [35.625940819995996]
We present CoGS, a method for the style-conditioned, sketch-driven synthesis of images.
CoGS enables exploration of diverse appearance possibilities for a given sketched object.
We show that our model, trained on the 125 object classes of our newly created Pseudosketches dataset, is capable of producing a diverse gamut of semantic content and appearance styles.
arXiv Detail & Related papers (2022-03-17T18:36:11Z) - Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose.
Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification.
We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.