Related papers: Transformer verbatim in-context retrieval across time and scale

Transformer verbatim in-context retrieval across time and scale

URL: http://arxiv.org/abs/2411.07075v1
Date: Mon, 11 Nov 2024 15:50:01 GMT
Title: Transformer verbatim in-context retrieval across time and scale
Authors: Kristijan Armeni, Marko Pranjić, Senja Pollak,
Abstract summary: In some cases, language models must retrieve in-context information verbatim. We show that verbatim in-context retrieval developed in a sudden transition early in the training process. We find that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks.
Score: 2.7941582470640784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.

Related papers

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension [77.93156509994994]
We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
arXiv Detail & Related papers (2025-08-03T23:59:31Z)
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting [15.69952375347308]
Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. We study structural in-context algorithms in a simple part-of-speech setting using both practical and toy models. We find that active forgetting, a technique that was recently introduced to help models generalize to new languages, forces models to adopt structural in-context learning solutions.
arXiv Detail & Related papers (2024-05-28T21:38:20Z)
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models [10.482557806309174]
We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on semantic phrase processing tasks. Thanks to ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks. Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension.
arXiv Detail & Related papers (2024-05-05T09:20:38Z)
Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning [32.178931149612644]
In-context learning enables language models to adapt to downstream data or incorporate tasks by few samples as demonstrations within the prompts. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations. We propose a novel approach "vocabulary-defined semantics"
arXiv Detail & Related papers (2024-01-29T14:29:48Z)
Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant. Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)
EXnet: Efficient In-context Learning for Data-less Text classification [0.0]
We present EXnet, a model specifically designed to perform in-context learning without limitations on the number of examples. We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization. With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.
arXiv Detail & Related papers (2023-05-24T01:40:57Z)
Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context. We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability. Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z)
Do Trajectories Encode Verb Meaning? [22.409307683247967]
Grounded language models learn to connect concrete categories like nouns and adjectives to the world via images and videos. In this paper, we investigate the extent to which trajectories (i.e. the position and rotation of objects over time) naturally encode verb semantics. We find that trajectories correlate as-is with some verbs (e.g., fall), and that additional abstraction via self-supervised pretraining can further capture nuanced differences in verb meaning.
arXiv Detail & Related papers (2022-06-23T19:57:16Z)
On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model [56.82120834538467]
We investigate the effects of the source and size of the pretraining corpus on in-context learning in a Korean-centric GPT-3 model. We find that in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning.
arXiv Detail & Related papers (2022-04-28T13:59:54Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning. We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept. We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.