Images Speak in Images: A Generalist Painter for In-Context Visual
Learning
- URL: http://arxiv.org/abs/2212.02499v2
- Date: Fri, 24 Mar 2023 07:10:47 GMT
- Title: Images Speak in Images: A Generalist Painter for In-Context Visual
Learning
- Authors: Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang
- Abstract summary: In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
- Score: 98.78475432114595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning, as a new paradigm in NLP, allows the model to rapidly
adapt to various tasks with only a handful of prompts and examples. But in
computer vision, the difficulties for in-context learning lie in that tasks
vary significantly in the output representations, thus it is unclear how to
define the general-purpose task prompts that the vision model can understand
and transfer to out-of-domain tasks. In this work, we present Painter, a
generalist model which addresses these obstacles with an "image"-centric
solution, that is, to redefine the output of core vision tasks as images, and
specify task prompts as also images. With this idea, our training process is
extremely simple, which performs standard masked image modeling on the stitch
of input and output image pairs. This makes the model capable of performing
tasks conditioned on visible image patches. Thus, during inference, we can
adopt a pair of input and output images from the same task as the input
condition, to indicate which task to perform. Without bells and whistles, our
generalist Painter can achieve competitive performance compared to
well-established task-specific models, on seven representative vision tasks
ranging from high-level visual understanding to low-level image processing. In
addition, Painter significantly outperforms recent generalist models on several
challenging tasks.
Related papers
- Learning A Low-Level Vision Generalist via Visual Task Prompt [43.54563263106761]
We propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges.
VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network.
Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks.
arXiv Detail & Related papers (2024-08-16T08:37:56Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Tuning computer vision models with task rewards [88.45787930908102]
Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models.
In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward.
We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning.
arXiv Detail & Related papers (2023-02-16T11:49:48Z) - Task Bias in Vision-Language Models [18.025004053980545]
We explore the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others.
To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest.
arXiv Detail & Related papers (2022-12-08T17:10:31Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks.
We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network.
Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.