Related papers: Learning A Low-Level Vision Generalist via Visual Task Prompt

Learning A Low-Level Vision Generalist via Visual Task Prompt

URL: http://arxiv.org/abs/2408.08601v1
Date: Fri, 16 Aug 2024 08:37:56 GMT
Title: Learning A Low-Level Vision Generalist via Visual Task Prompt
Authors: Xiangyu Chen, Yihao Liu, Yuandong Pu, Wenlong Zhang, Jiantao Zhou, Yu Qiao, Chao Dong,
Abstract summary: We propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks.
Score: 43.54563263106761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at https://github.com/chxy95/GenLV.

Related papers

Test-Time Visual In-Context Tuning [85.62916644835902]
Visual in-context learning (VICL) allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. We propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample.
arXiv Detail & Related papers (2025-03-27T17:59:52Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z)
Explicit Visual Prompting for Universal Foreground Segmentations [55.51869354956533]
We present a unified framework for a number of foreground segmentation tasks without any task-specific designs. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP. Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters.
arXiv Detail & Related papers (2023-05-29T11:05:01Z)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z)
Multitask Vision-Language Prompt Tuning [103.5967011236282]
We propose multitask vision-language prompt tuning (MV) MV incorporates cross-task knowledge into prompt tuning for vision-language models. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods.
arXiv Detail & Related papers (2022-11-21T18:41:44Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks. We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network. Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.