A Unified Sequence Interface for Vision Tasks
- URL: http://arxiv.org/abs/2206.07669v1
- Date: Wed, 15 Jun 2022 17:08:53 GMT
- Title: A Unified Sequence Interface for Vision Tasks
- Authors: Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet,
Geoffrey Hinton
- Abstract summary: We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
- Score: 87.328893553186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While language tasks are naturally expressed in a single, unified, modeling
framework, i.e., generating sequences of tokens, this has not been the case in
computer vision. As a result, there is a proliferation of distinct
architectures and loss functions for different vision tasks. In this work we
show that a diverse set of "core" computer vision tasks can also be unified if
formulated in terms of a shared pixel-to-sequence interface. We focus on four
tasks, namely, object detection, instance segmentation, keypoint detection, and
image captioning, all with diverse types of outputs, e.g., bounding boxes or
dense masks. Despite that, by formulating the output of each task as a sequence
of discrete tokens with a unified interface, we show that one can train a
neural network with a single model architecture and loss function on all these
tasks, with no task-specific customization. To solve a specific task, we use a
short prompt as task description, and the sequence output adapts to the prompt
so it can produce task-specific output. We show that such a model can achieve
competitive performance compared to well-established task-specific models.
Related papers
- Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist.
We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies.
Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - All in Tokens: Unifying Output Space of Visual Tasks via Soft Token [30.6086480249568]
We show a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation.
We propose several new techniques that take into account the particularity of visual tasks.
We achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark.
arXiv Detail & Related papers (2023-01-05T18:55:20Z) - Images Speak in Images: A Generalist Painter for In-Context Visual
Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z) - Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks.
We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens.
Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.