Related papers: Painter: Teaching Auto-regressive Language Models to Draw Sketches

Painter: Teaching Auto-regressive Language Models to Draw Sketches

URL: http://arxiv.org/abs/2308.08520v1
Date: Wed, 16 Aug 2023 17:18:30 GMT
Title: Painter: Teaching Auto-regressive Language Models to Draw Sketches
Authors: Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic
Abstract summary: We present Painter, an LLM that can convert user prompts in text description format to sketches. We create a dataset of diverse multi-object sketches paired with textual prompts. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.
Score: 5.3445140425713245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.

Related papers

SketchAgent: Language-Driven Sequential Sketch Generation [34.96339247291013]
SketchAgent is a language-driven, sequential sketch generation method. We present an intuitive sketching language, introduced to the model through in-context examples. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching.
arXiv Detail & Related papers (2024-11-26T18:32:06Z)
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions. We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location. This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens. ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Paint by Inpaint: Learning to Add Image Objects by Removing Them First [8.399234415641319]
We train a diffusion model to inverse the inpainting process, effectively adding objects into images. Our results show that the trained model surpasses existing models in both object addition and general editing tasks.
arXiv Detail & Related papers (2024-04-28T15:07:53Z)
LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [62.232361821779335]
We introduce a tuning-free attention control framework, encapsulated by the progressive process of prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity. Our meticulous control over spatial features and self-attention ensures structural consistency in the images.
arXiv Detail & Related papers (2024-04-21T07:13:56Z)
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z)
Towards Language-Driven Video Inpainting via Multimodal Large Language Models [116.22805434658567]
We introduce a new task -- language-driven video inpainting. It uses natural language instructions to guide the inpainting process. We present the Remove Objects from Videos by Instructions dataset.
arXiv Detail & Related papers (2024-01-18T18:59:13Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.