Related papers: Tuning computer vision models with task rewards

Tuning computer vision models with task rewards

URL: http://arxiv.org/abs/2302.08242v1
Date: Thu, 16 Feb 2023 11:49:48 GMT
Title: Tuning computer vision models with task rewards
Authors: Andr\'e Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai
Abstract summary: Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning.
Score: 88.45787930908102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning. We believe this approach has the potential to be widely useful for better aligning models with a diverse range of computer vision tasks.

Related papers

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools. Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions. InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks. It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks. We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network. Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z)
Cross-View Image Synthesis with Deformable Convolution and Attention Mechanism [29.528402825356398]
We propose to use Generative Adversarial Networks(GANs) based on a deformable convolution and attention mechanism to solve the problem of cross-view image synthesis. It is difficult to understand and transform scenes appearance and semantic information from another view, thus we use deformed convolution in the U-net network to improve the network's ability to extract features of objects at different scales.
arXiv Detail & Related papers (2020-07-20T03:08:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.