In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
- URL: http://arxiv.org/abs/2504.17068v1
- Date: Wed, 23 Apr 2025 19:30:01 GMT
- Title: In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
- Authors: Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta,
- Abstract summary: We show that in-context learning can distort the relationship between fitness and likelihood scores of sequences.<n>This phenomenon manifests as anomalously high likelihood scores for sequences that contain repeated motifs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.
Related papers
- On the importance of structural identifiability for machine learning with partially observed dynamical systems [0.7864304771129751]
We use structural identifiability analysis to explicitly relate parameter configurations that are associated with identical system outputs.<n>Our results demonstrate the importance of accounting for structural identifiability, a topic that has received relatively little attention from the machine learning community.
arXiv Detail & Related papers (2025-02-06T15:06:52Z) - From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty [67.81977289444677]
Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions.<n>We categorize fallback behaviors - sequence repetitions, degenerate text, and hallucinations - and extensively analyze them.<n>Our experiments reveal a clear and consistent ordering of fallback behaviors, across all these axes.
arXiv Detail & Related papers (2024-07-08T16:13:42Z) - Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon [22.271015657198927]
We break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither.
By analyzing dependencies and inspecting the weights of a predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
arXiv Detail & Related papers (2024-06-25T17:32:16Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Learning to Reason With Relational Abstractions [65.89553417442049]
We study how to build stronger reasoning capability in language models using the idea of relational abstractions.
We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy.
arXiv Detail & Related papers (2022-10-06T00:27:50Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - Can a Transformer Pass the Wug Test? Tuning Copying Bias in Neural
Morphological Inflection Models [9.95909045828344]
We show that, to be more effective, the hallucination process needs to pay attention to syllable-like length rather than individual characters or stems.
We report a significant performance improvement with our hallucination model over previous data hallucination methods when training and test data do not overlap in their lemmata.
arXiv Detail & Related papers (2021-04-13T19:51:21Z) - Interpretable Structured Learning with Sparse Gated Sequence Encoder for
Protein-Protein Interaction Prediction [2.9488233765621295]
Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology.
We present a novel deep framework to model and predict PPIs from sequence alone.
Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences.
arXiv Detail & Related papers (2020-10-16T17:13:32Z) - Do Neural Models Learn Systematicity of Monotonicity Inference in
Natural Language? [41.649440404203595]
We introduce a method for evaluating whether neural models can learn systematicity of monotonicity inference in natural language.
We consider four aspects of monotonicity inferences and test whether the models can systematically interpret lexical and logical phenomena on different training/test splits.
arXiv Detail & Related papers (2020-04-30T14:48:39Z) - Consistency of a Recurrent Language Model With Respect to Incomplete
Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model.
We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.