A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
- URL: http://arxiv.org/abs/2303.17376v1
- Date: Thu, 30 Mar 2023 13:42:58 GMT
- Title: A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
- Authors: Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner,
Alexander Kolesnikov, Andr\'e Susano Pinto, Emanuele Bugliarello, Xiao Wang,
Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai
- Abstract summary: We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision.
A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well.
It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.
- Score: 93.90545426665999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been a recent explosion of computer vision models which perform
many tasks and are composed of an image encoder (usually a ViT) and an
autoregressive decoder (usually a Transformer). However, most of this work
simply presents one system and its results, leaving many questions regarding
design decisions and trade-offs of such systems unanswered. In this work, we
aim to provide such answers. We take a close look at autoregressive decoders
for multi-task learning in multimodal computer vision, including
classification, captioning, visual question answering, and optical character
recognition. Through extensive systematic experiments, we study the effects of
task and data mixture, training and regularization hyperparameters,
conditioning type and specificity, modality combination, and more. Importantly,
we compare these to well-tuned single-task baselines to highlight the cost
incurred by multi-tasking. A key finding is that a small decoder learned on top
of a frozen pretrained encoder works surprisingly well. We call this setup
locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a
decoder to interact with a pretrained vision model via natural language.
Related papers
- Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Task-Aware Encoder Control for Deep Video Compression [26.778793247958053]
We introduce an innovative encoder controller for deep video compression for machines.
This controller features a mode prediction and a Group of Pictures (GoP) selection module.
Our approach centralizes control at the encoding stage, allowing adjustments across different tasks.
arXiv Detail & Related papers (2024-04-07T07:42:04Z) - Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition [9.426097444566704]
We develop a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos.
This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses.
Experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks.
arXiv Detail & Related papers (2024-03-28T19:10:54Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z) - Auto-Encoder based Co-Training Multi-View Representation Learning [10.120166898507328]
We propose a novel algorithm called Auto-encoder based Co-training Multi-View Learning (ACMVL)
The algorithm has two stages, the first is to train auto-encoder of each view, and the second stage is to train a supervised network.
According to the experimental result, we can learn a well performed latent feature representation, and auto-encoder of each view has more powerful reconstruction ability than traditional auto-encoder.
arXiv Detail & Related papers (2022-01-09T10:20:16Z) - Distilled Dual-Encoder Model for Vision-Language Understanding [50.42062182895373]
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks.
We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
arXiv Detail & Related papers (2021-12-16T09:21:18Z) - Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration.
We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z) - Human-Machine Collaborative Video Coding Through Cuboidal Partitioning [26.70051123157869]
We propose a video coding framework by leveraging on to the commonality that exists between human vision and machine vision applications using cuboids.
Cuboids, estimated rectangular regions over a video frame, are computationally efficient, has a compact representation and object centric.
Herein cuboidal feature descriptors are extracted from the current frame and then employed for accomplishing a machine vision task in the form of object detection.
arXiv Detail & Related papers (2021-02-02T04:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.