Related papers: A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

URL: http://arxiv.org/abs/2303.17376v1
Date: Thu, 30 Mar 2023 13:42:58 GMT
Title: A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Authors: Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, Andr\'e Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai
Abstract summary: We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision. A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well. It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.
Score: 93.90545426665999
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answers. We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision, including classification, captioning, visual question answering, and optical character recognition. Through extensive systematic experiments, we study the effects of task and data mixture, training and regularization hyperparameters, conditioning type and specificity, modality combination, and more. Importantly, we compare these to well-tuned single-task baselines to highlight the cost incurred by multi-tasking. A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well. We call this setup locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.

Related papers

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders [28.22099619211775]
Visual encoders are fundamental components in vision-language models (VLMs) Recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. We present a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model.
arXiv Detail & Related papers (2025-01-03T09:10:34Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference [43.474068248379815]
We propose a unified encoder trained on multiple computer vision tasks crucial for urban driving. By integrating diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet.
arXiv Detail & Related papers (2024-09-16T08:54:03Z)
Task-Aware Encoder Control for Deep Video Compression [26.778793247958053]
We introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage, allowing adjustments across different tasks.
arXiv Detail & Related papers (2024-04-07T07:42:04Z)
Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition [9.426097444566704]
We develop a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks.
arXiv Detail & Related papers (2024-03-28T19:10:54Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z)
Auto-Encoder based Co-Training Multi-View Representation Learning [10.120166898507328]
We propose a novel algorithm called Auto-encoder based Co-training Multi-View Learning (ACMVL) The algorithm has two stages, the first is to train auto-encoder of each view, and the second stage is to train a supervised network. According to the experimental result, we can learn a well performed latent feature representation, and auto-encoder of each view has more powerful reconstruction ability than traditional auto-encoder.
arXiv Detail & Related papers (2022-01-09T10:20:16Z)
Distilled Dual-Encoder Model for Vision-Language Understanding [50.42062182895373]
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks. We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
arXiv Detail & Related papers (2021-12-16T09:21:18Z)
Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration. We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z)
Human-Machine Collaborative Video Coding Through Cuboidal Partitioning [26.70051123157869]
We propose a video coding framework by leveraging on to the commonality that exists between human vision and machine vision applications using cuboids. Cuboids, estimated rectangular regions over a video frame, are computationally efficient, has a compact representation and object centric. Herein cuboidal feature descriptors are extracted from the current frame and then employed for accomplishing a machine vision task in the form of object detection.
arXiv Detail & Related papers (2021-02-02T04:44:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.