Crossing the Format Boundary of Text and Boxes: Towards Unified
Vision-Language Modeling
- URL: http://arxiv.org/abs/2111.12085v1
- Date: Tue, 23 Nov 2021 18:59:14 GMT
- Title: Crossing the Format Boundary of Text and Boxes: Towards Unified
Vision-Language Modeling
- Authors: Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed,
Zicheng Liu, Yumao Lu, Lijuan Wang
- Abstract summary: UNICORN is a vision-language model that unifies text generation and bounding box prediction into a single architecture.
We formulate all vision-language problems as a generation task, where the target sequence consists of the integrated text and box tokens.
With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks.
- Score: 50.370767959977506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose UNICORN, a vision-language (VL) model that unifies
text generation and bounding box prediction into a single architecture.
Specifically, we quantize each box into four discrete box tokens and serialize
them as a sequence, which can be integrated with text tokens. We formulate all
VL problems as a generation task, where the target sequence consists of the
integrated text and box tokens. We then train a transformer encoder-decoder to
predict the target in an auto-regressive manner. With such a unified framework
and input-output format, UNICORN achieves comparable performance to
task-specific state of the art on 7 VL benchmarks, covering the visual
grounding, grounded captioning, visual question answering, and image captioning
tasks. When trained with multi-task finetuning, UNICORN can approach different
VL tasks with a single set of parameters, thus crossing downstream task
boundary. We show that having a single model not only saves parameters, but
also further boosts the model performance on certain tasks. Finally, UNICORN
shows the capability of generalizing to new tasks such as ImageNet object
localization.
Related papers
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond [72.41822115096741]
We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs)
We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.
The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
arXiv Detail & Related papers (2023-08-24T17:59:17Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks.
We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens.
Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.