GLID: Pre-training a Generalist Encoder-Decoder Vision Model
- URL: http://arxiv.org/abs/2404.07603v1
- Date: Thu, 11 Apr 2024 09:43:07 GMT
- Title: GLID: Pre-training a Generalist Encoder-Decoder Vision Model
- Authors: Jihao Liu, Jinliang Zheng, Yu Liu, Hongsheng Li,
- Abstract summary: We propose a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks.
GLID allows the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications.
GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation.
- Score: 36.242095346942556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outperforming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.
Related papers
- How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks? [9.515532265294187]
Self-supervised pre-training has proven highly effective for many computer vision tasks.
It remains unclear under which conditions pre-trained models offer significant advantages over training from scratch.
arXiv Detail & Related papers (2024-09-27T08:15:14Z) - Benchmarking Robust Self-Supervised Learning Across Diverse Downstream Tasks [9.207022068713867]
We present a comprehensive empirical evaluation of the adversarial robustness of self-supervised vision encoders across multiple downstream tasks.
Our attacks operate in the encoder embedding space and at the downstream task output level.
Since the purpose of a foundation model is to cater to multiple applications at once, our findings reveal the need to enhance encoder robustness more broadly.
arXiv Detail & Related papers (2024-07-17T14:12:34Z) - Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist.
We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies.
Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z) - Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged
Object Detection [38.5505943598037]
We propose a novel pre-train, adapt and detect' paradigm to detect camouflaged objects.
By introducing a large pre-trained model, abundant knowledge learned from massive multi-modal data can be directly transferred to COD.
Our method outperforms existing state-of-the-art COD models by large margins.
arXiv Detail & Related papers (2023-07-20T08:25:38Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Unifying Architectures, Tasks, and Modalities Through a Simple
Sequence-to-Sequence Learning Framework [83.82026345508334]
We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std encoder acc.: 80.02), SNLI-VE (test acc.: 90.
arXiv Detail & Related papers (2022-02-07T10:38:21Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.