Related papers: A Unified Sequence Interface for Vision Tasks

A Unified Sequence Interface for Vision Tasks

URL: http://arxiv.org/abs/2206.07669v1
Date: Wed, 15 Jun 2022 17:08:53 GMT
Title: A Unified Sequence Interface for Vision Tasks
Authors: Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton
Abstract summary: We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs. We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
Score: 87.328893553186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

Related papers

Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist. We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies. Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z)
A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks. These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z)
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token [30.6086480249568]
We show a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation. We propose several new techniques that take into account the particularity of visual tasks. We achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark.
arXiv Detail & Related papers (2023-01-05T18:55:20Z)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z)
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.