eP-ALM: Efficient Perceptual Augmentation of Language Models
- URL: http://arxiv.org/abs/2303.11403v4
- Date: Fri, 27 Oct 2023 16:38:40 GMT
- Title: eP-ALM: Efficient Perceptual Augmentation of Language Models
- Authors: Mustafa Shukor, Corentin Dancette, Matthieu Cord
- Abstract summary: We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
- Score: 70.47962271121389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have so far impressed the world, with
unprecedented capabilities that emerge in models at large scales. On the vision
side, transformer models (i.e., ViT) are following the same trend, achieving
the best performance on challenging benchmarks. With the abundance of such
unimodal models, a natural question arises; do we need also to follow this
trend to tackle multimodal tasks? In this work, we propose to rather direct
effort to efficient adaptations of existing models, and propose to augment
Language Models with perception. Existing approaches for adapting pretrained
models for vision-language tasks still rely on several key components that
hinder their efficiency. In particular, they still train a large number of
parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP)
trained on huge image-text datasets, and add significant inference overhead. In
addition, most of these approaches have focused on Zero-Shot and In Context
Learning, with little to no effort on direct finetuning. We investigate the
minimal computational effort needed to adapt unimodal models for multimodal
tasks and propose a new challenging setup, alongside different approaches, that
efficiently adapts unimodal pretrained models. We show that by freezing more
than 99% of total parameters, training only one linear projection layer, and
prepending only one trainable token, our approach (dubbed eP-ALM) significantly
outperforms other baselines on VQA and Captioning across Image, Video, and
Audio modalities, following the proposed setup. The code is available here:
https://github.com/mshukor/eP-ALM.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - POA: Pre-training Once for Models of All Sizes [33.72644336390202]
We propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All)
Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm.
It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones.
arXiv Detail & Related papers (2024-08-02T06:13:29Z) - Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs)
We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies.
Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts [14.610244867640471]
Recent vision-language models are driven by large-scale pretrained models.
We introduce a parameter-efficient method to address challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language.
Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency.
arXiv Detail & Related papers (2023-09-27T18:00:09Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese [33.83704598544326]
Mengzi stands for a family of discriminative, generative, domain-specific, and multimodal pre-trained model variants.
Compared with public Chinese PLMs, Mengzi is simple but more powerful.
Our lightweight model has achieved new state-of-the-art results on the widely-used CLUE benchmark.
arXiv Detail & Related papers (2021-10-13T13:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.