MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models
- URL: http://arxiv.org/abs/2306.01311v1
- Date: Fri, 2 Jun 2023 07:21:03 GMT
- Title: MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models
- Authors: Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F.
Yang, Kai-Wei Chang
- Abstract summary: In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
- Score: 74.89629463600978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language models have shown the ability to adapt to a new task via
conditioning on a few demonstrations (i.e., in-context learning). However, in
the vision-language domain, most large-scale pre-trained vision-language (VL)
models do not possess the ability to conduct in-context learning. How can we
enable in-context learning for VL models? In this paper, we study an
interesting hypothesis: can we transfer the in-context learning ability from
the language domain to VL domain? Specifically, we first meta-trains a language
model to perform in-context learning on NLP tasks (as in MetaICL); then we
transfer this model to perform VL tasks by attaching a visual encoder. Our
experiments suggest that indeed in-context learning ability can be transferred
cross modalities: our model considerably improves the in-context learning
capability on VL tasks and can even compensate for the size of the model
significantly. On VQA, OK-VQA, and GQA, our method could outperform the
baseline model while having 20 times fewer parameters.
Related papers
- TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - In-context Learning Distillation: Transferring Few-shot Learning Ability
of Pre-trained Language Models [55.78264509270503]
We introduce in-context learning distillation to transfer in-context few-shot learning ability from large models to smaller models.
We perform in-context learning distillation under two different few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT)
Our experiments and analysis reveal that in-context learning objectives and language modeling objectives are complementary under the Multitask-ICT paradigm.
arXiv Detail & Related papers (2022-12-20T22:11:35Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - How to Adapt Pre-trained Vision-and-Language Models to a Text-only
Input? [0.13706331473063876]
We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities.
An unresolved issue with evaluating the linguistic skills of these models is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty.
Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks.
arXiv Detail & Related papers (2022-09-19T13:00:12Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.