Painter: Teaching Auto-regressive Language Models to Draw Sketches
- URL: http://arxiv.org/abs/2308.08520v1
- Date: Wed, 16 Aug 2023 17:18:30 GMT
- Title: Painter: Teaching Auto-regressive Language Models to Draw Sketches
- Authors: Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit
Madan, Roland Memisevic
- Abstract summary: We present Painter, an LLM that can convert user prompts in text description format to sketches.
We create a dataset of diverse multi-object sketches paired with textual prompts.
Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.
- Score: 5.3445140425713245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have made tremendous progress in natural
language understanding and they have also been successfully adopted in other
domains such as computer vision, robotics, reinforcement learning, etc. In this
work, we apply LLMs to image generation tasks by directly generating the
virtual brush strokes to paint an image. We present Painter, an LLM that can
convert user prompts in text description format to sketches by generating the
corresponding brush strokes in an auto-regressive way. We construct Painter
based on off-the-shelf LLM that is pre-trained on a large text corpus, by
fine-tuning it on the new task while preserving language understanding
capabilities. We create a dataset of diverse multi-object sketches paired with
textual prompts that covers several object types and tasks. Painter can
generate sketches from text descriptions, remove objects from canvas, and
detect and classify objects in sketches. Although this is an unprecedented
pioneering work in using LLMs for auto-regressive image generation, the results
are very encouraging.
Related papers
- PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location.
This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens.
ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [62.232361821779335]
We introduce a tuning-free attention control framework, encapsulated by the progressive process of prompt-Aware editing, StablE animation geneRation, abbreviated as LASER.
We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity.
Our meticulous control over spatial features and self-attention ensures structural consistency in the images.
arXiv Detail & Related papers (2024-04-21T07:13:56Z) - Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z) - Towards Language-Driven Video Inpainting via Multimodal Large Language Models [116.22805434658567]
We introduce a new task -- language-driven video inpainting.
It uses natural language instructions to guide the inpainting process.
We present the Remove Objects from Videos by Instructions dataset.
arXiv Detail & Related papers (2024-01-18T18:59:13Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.