PolyViT: Co-training Vision Transformers on Images, Videos and Audio
- URL: http://arxiv.org/abs/2111.12993v1
- Date: Thu, 25 Nov 2021 10:01:05 GMT
- Title: PolyViT: Co-training Vision Transformers on Images, Videos and Audio
- Authors: Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario
Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani
- Abstract summary: We present PolyViT, a model trained on image, audio and video.
By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task.
We show that co-training is simple and practical to implement.
- Score: 80.0913507142036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can we train a single transformer model capable of processing multiple
modalities and datasets, whilst sharing almost all of its learnable parameters?
We present PolyViT, a model trained on image, audio and video which answers
this question. By co-training different tasks on a single modality, we are able
to improve the accuracy of each individual task and achieve state-of-the-art
results on 5 standard video- and audio-classification datasets. Co-training
PolyViT on multiple modalities and tasks leads to a model that is even more
parameter-efficient, and learns representations that generalize across multiple
domains. Moreover, we show that co-training is simple and practical to
implement, as we do not need to tune hyperparameters for each combination of
datasets, but can simply adapt those from standard, single-task training.
Related papers
- 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision
Tasks [36.34331439747556]
We propose Polyhistor and Polyhistor-Lite to share information across different tasks with a few trainable parameters.
Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only using 10% of their trainable parameters.
arXiv Detail & Related papers (2022-10-07T00:25:02Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.