Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
- URL: http://arxiv.org/abs/2206.08916v1
- Date: Fri, 17 Jun 2022 17:53:47 GMT
- Title: Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
- Authors: Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi,
Aniruddha Kembhavi
- Abstract summary: Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks.
We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens.
Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
- Score: 39.12025963907317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Unified-IO, a model that performs a large variety of AI tasks
spanning classical computer vision tasks, including pose estimation, object
detection, depth estimation and image generation, vision-and-language tasks
such as region captioning and referring expression comprehension, to natural
language processing tasks such as question answering and paraphrasing.
Developing a single unified model for such a large variety of tasks poses
unique challenges due to the heterogeneous inputs and outputs pertaining to
each task, including RGB images, per-pixel maps, binary masks, bounding boxes,
and language. We achieve this unification by homogenizing every supported input
and output into a sequence of discrete vocabulary tokens. This common
representation across all tasks allows us to train a single transformer-based
architecture, jointly on over 80 diverse datasets in the vision and language
fields. Unified-IO is the first model capable of performing all 7 tasks on the
GRIT benchmark and produces strong results across 16 diverse benchmarks like
NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail,
with no task or benchmark specific fine-tuning. Demos for Unified-IO are
available at https://unified-io.allenai.org.
Related papers
- GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.