Genomic Next-Token Predictors are In-Context Learners
- URL: http://arxiv.org/abs/2511.12797v2
- Date: Fri, 21 Nov 2025 02:11:05 GMT
- Title: Genomic Next-Token Predictors are In-Context Learners
- Authors: Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi,
- Abstract summary: In-context learning (ICL) has been extensively studied in large language models trained for next-token prediction on human text.<n>This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training?<n>We show that genomic models exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases.
- Score: 34.25770424888426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.
Related papers
- Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - Modeling cognitive processes of natural reading with transformer-based Language Models [2.048226951354646]
Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors.<n>In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship.<n>Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers.
arXiv Detail & Related papers (2025-05-16T17:47:58Z) - Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization [26.9153121765435]
Large language models (LLMs) have demonstrated remarkable in-context learning abilities.<n>This paper investigates how ICL emerges and the impact of pre-training phase on ICL.<n>Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
arXiv Detail & Related papers (2025-02-24T10:26:29Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Long-range gene expression prediction with token alignment of large language model [37.10820914895689]
We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens.
GTA learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts.
GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model.
arXiv Detail & Related papers (2024-10-02T02:42:29Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Modeling structure-building in the brain with CCG parsing and large
language models [9.17816011606258]
Combinatory Categorial Grammars (CCGs) are sufficiently expressive directly compositional models of grammar.
We evaluate whether a more expressive CCG provides a better model than a context-free grammar for human neural signals collected with fMRI.
arXiv Detail & Related papers (2022-10-28T14:21:29Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.