Unifying Image Processing as Visual Prompting Question Answering
- URL: http://arxiv.org/abs/2310.10513v2
- Date: Wed, 21 Feb 2024 03:31:39 GMT
- Title: Unifying Image Processing as Visual Prompting Question Answering
- Authors: Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu
Qiao, Chao Dong
- Abstract summary: Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
- Score: 62.84955983910612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image processing is a fundamental task in computer vision, which aims at
enhancing image quality and extracting essential features for subsequent vision
applications. Traditionally, task-specific models are developed for individual
tasks and designing such models requires distinct expertise. Building upon the
success of large language models (LLMs) in natural language processing (NLP),
there is a similar trend in computer vision, which focuses on developing
large-scale models through pretraining and in-context learning. This paradigm
shift reduces the reliance on task-specific models, yielding a powerful unified
model to deal with various tasks. However, these advances have predominantly
concentrated on high-level vision tasks, with less attention paid to low-level
vision tasks. To address this issue, we propose a universal model for general
image processing that covers image restoration, image enhancement, image
feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies
these diverse image processing tasks within a universal framework. Inspired by
NLP question answering (QA) techniques, we employ a visual prompting question
answering paradigm. Specifically, we treat the input-output image pair as a
structured question-answer sentence, thereby reprogramming the image processing
task as a prompting QA problem. PromptGIP can undertake diverse cross-domain
tasks using provided visual prompts, eliminating the need for task-specific
finetuning. Our methodology offers a universal and adaptive solution to general
image processing. While PromptGIP has demonstrated a certain degree of
out-of-domain task generalization capability, further research is expected to
fully explore its more powerful emergent generalization.
Related papers
- Learning A Low-Level Vision Generalist via Visual Task Prompt [43.54563263106761]
We propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges.
VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network.
Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks.
arXiv Detail & Related papers (2024-08-16T08:37:56Z) - Multi-Expert Adaptive Selection: Task-Balancing for All-in-One Image Restoration [20.04384107349706]
We propose a multi-expert adaptive selection mechanism for multi-task image restoration.
The scheme adaptively selects the most suitable expert from the expert library according to the content of the input image and the prompts of the current task.
Experimental results demonstrate that our proposed method is both effective and superior to existing approaches.
arXiv Detail & Related papers (2024-07-27T01:13:07Z) - PromptFix: You Prompt and We Fix the Photo [84.69812824355269]
Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks.
The lack of diverse instruction-following data hampers the development of models.
We propose PromptFix, a framework that enables diffusion models to follow human instructions.
arXiv Detail & Related papers (2024-05-27T03:13:28Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Tuning computer vision models with task rewards [88.45787930908102]
Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models.
In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward.
We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning.
arXiv Detail & Related papers (2023-02-16T11:49:48Z) - Images Speak in Images: A Generalist Painter for In-Context Visual
Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z) - Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks.
We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network.
Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.