StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
- URL: http://arxiv.org/abs/2401.17093v1
- Date: Tue, 30 Jan 2024 15:20:26 GMT
- Title: StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
- Authors: Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu
Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan
- Abstract summary: StrokeNUWA is a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics.
equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods.
StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.
- Score: 112.25071764647683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To leverage LLMs for visual synthesis, traditional methods convert raster
image information into discrete grid tokens through specialized visual modules,
while disrupting the model's ability to capture the true semantic
representation of visual scenes. This paper posits that an alternative
representation of images, vector graphics, can effectively surmount this
limitation by enabling a more natural and semantically coherent segmentation of
the image information. Thus, we introduce StrokeNUWA, a pioneering work
exploring a better visual representation ''stroke tokens'' on vector graphics,
which is inherently visual semantics rich, naturally compatible with LLMs, and
highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly
surpass traditional LLM-based and optimization-based methods across various
metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up
to a 94x speedup in inference over the speed of prior methods with an
exceptional SVG code compression ratio of 6.9%.
Related papers
- EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM [38.8308841469793]
This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt.
We leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM) to exploit consistent visual elements within multiple images.
Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
arXiv Detail & Related papers (2024-12-12T18:59:48Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.
compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition.
We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3.
With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language.
We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - Extracting Triangular 3D Models, Materials, and Lighting From Images [59.33666140713829]
We present an efficient method for joint optimization of materials and lighting from multi-view image observations.
We leverage meshes with spatially-varying materials and environment that can be deployed in any traditional graphics engine.
arXiv Detail & Related papers (2021-11-24T13:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.