SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
- URL: http://arxiv.org/abs/2307.07742v2
- Date: Sat, 19 Aug 2023 08:27:16 GMT
- Title: SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
- Authors: Yi-Syuan Chen, Yun-Zhu Song, Cheng Yu Yeo, Bei Liu, Jianlong Fu,
Hong-Han Shuai
- Abstract summary: We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
- Score: 64.44336003123102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Pre-trained Transformers exhibit an intriguing capacity for in-context
learning. Without gradient updates, these models can rapidly construct new
predictors from demonstrations presented in the inputs. Recent works promote
this ability in the vision-language domain by incorporating visual information
into large language models that can already make in-context predictions.
However, these methods could inherit issues in the language domain, such as
template sensitivity and hallucination. Also, the scale of these language
models raises a significant demand for computations, making learning and
operating these models resource-intensive. To this end, we raise a question:
``How can we enable in-context learning without relying on the intrinsic
in-context ability of large language models?". To answer it, we propose a
succinct and general framework, Self-supervised IN-Context learning (SINC),
that introduces a meta-model to learn on self-supervised prompts consisting of
tailored demonstrations. The learned models can be transferred to downstream
tasks for making in-context predictions on-the-fly. Extensive experiments show
that SINC outperforms gradient-based methods in various vision-language tasks
under few-shot settings. Furthermore, the designs of SINC help us investigate
the benefits of in-context learning across different tasks, and the analysis
further reveals the essential components for the emergence of in-context
learning in the vision-language domain.
Related papers
- Can Large Language Models Understand Context? [17.196362853457412]
This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models.
Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models.
As LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings.
arXiv Detail & Related papers (2024-02-01T18:55:29Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - Fine-Tune Language Models as Multi-Modal Differential Equation Solvers [14.181842691371935]
We present a transformation of in-context operator learning into a multi-modal paradigm.
In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator.
arXiv Detail & Related papers (2023-08-09T16:44:25Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - The Learnability of In-Context Learning [16.182561312622315]
We propose a first-of-its-kind PAC based framework for in-context learnability.
Our framework includes an initial pretraining phase, which fits a function to the pretraining distribution.
We show that in-context learning is more about identifying the task than about learning it.
arXiv Detail & Related papers (2023-03-14T13:28:39Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - AttViz: Online exploration of self-attention for transparent neural
language modeling [7.574392147428978]
We propose AttViz, an online toolkit for exploration of self-attention---real values associated with individual text tokens.
We show how existing deep learning pipelines can produce outputs suitable for AttViz, offering novel visualizations of the attention heads and their aggregations with minimal effort, online.
arXiv Detail & Related papers (2020-05-12T12:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.