Reprogramming Pretrained Language Models for Antibody Sequence Infilling
- URL: http://arxiv.org/abs/2210.07144v2
- Date: Mon, 19 Jun 2023 21:42:43 GMT
- Title: Reprogramming Pretrained Language Models for Antibody Sequence Infilling
- Authors: Igor Melnyk, Vijil Chenthamarakshan, Pin-Yu Chen, Payel Das, Amit
Dhurandhar, Inkit Padhi, Devleena Das
- Abstract summary: Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
- Score: 72.13295049594585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Antibodies comprise the most versatile class of binding molecules, with
numerous applications in biomedicine. Computational design of antibodies
involves generating novel and diverse sequences, while maintaining structural
consistency. Unique to antibodies, designing the complementarity-determining
region (CDR), which determines the antigen binding affinity and specificity,
creates its own unique challenges. Recent deep learning models have shown
impressive results, however the limited number of known antibody
sequence/structure pairs frequently leads to degraded performance, particularly
lacking diversity in the generated sequences. In our work we address this
challenge by leveraging Model Reprogramming (MR), which repurposes pretrained
models on a source language to adapt to the tasks that are in a different
language and have scarce data - where it may be difficult to train a
high-performing model from scratch or effectively fine-tune an existing
pre-trained model on the specific task. Specifically, we introduce ReprogBert
in which a pretrained English language model is repurposed for protein sequence
infilling - thus considers cross-language adaptation using less data. Results
on antibody design benchmarks show that our model on low-resourced antibody
sequence dataset provides highly diverse CDR sequences, up to more than a
two-fold increase of diversity over the baselines, without losing structural
integrity and naturalness. The generated sequences also demonstrate enhanced
antigen binding specificity and virus neutralization ability. Code is available
at https://github.com/IBM/ReprogBERT
Related papers
- S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning [8.059724314850799]
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19.
Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions.
This paper proposes Sequence-Structure multi-level pre-trained antibody Language Model (S$2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model.
arXiv Detail & Related papers (2024-11-20T14:24:26Z) - Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - Large scale paired antibody language models [40.401345152825314]
We present IgBert and IgT5, the best performing antibody-specific language models developed to date.
These models are trained comprehensively using the more than two billion Observed Space dataset.
This advancement marks a significant leap forward in leveraging machine learning, large data sets and high-performance computing for enhancing antibody design for therapeutic development.
arXiv Detail & Related papers (2024-03-26T17:21:54Z) - Decoupled Sequence and Structure Generation for Realistic Antibody Design [45.72237864940556]
We propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction.
We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens.
Our results demonstrate that ASSD consistently outperforms existing antibody design models.
arXiv Detail & Related papers (2024-02-08T13:02:05Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Incorporating Pre-training Paradigm for Antibody Sequence-Structure
Co-design [134.65287929316673]
Deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences.
The computational methods heavily rely on high-quality antibody structure data, which is quite limited.
Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data.
arXiv Detail & Related papers (2022-10-26T15:31:36Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Benchmarking deep generative models for diverse antibody sequence design [18.515971640245997]
Deep generative models that learn from sequences alone or from sequences and structures jointly have shown impressive performance on this task.
We consider three recently proposed deep generative frameworks for protein design: (AR) the sequence-based autoregressive generative model, (GVP) the precise structure-based graph neural network, and Fold2Seq that leverages a fuzzy and scale-free representation of a three-dimensional fold.
We benchmark these models on the task of computational design of antibody sequences, which demand designing sequences with high diversity for functional implication.
arXiv Detail & Related papers (2021-11-12T16:23:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.