AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca
for Predicting Antigen-Antibody Interactions
- URL: http://arxiv.org/abs/2306.03329v2
- Date: Wed, 11 Oct 2023 00:42:26 GMT
- Title: AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca
for Predicting Antigen-Antibody Interactions
- Authors: Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura,
Jennifer N. Wei, Zelda Mariet, Poomarin Phloyphisut, Hidetoshi Shimokawa,
Joseph R. Ledsam, Lucy Colwell, Akihiro Imura
- Abstract summary: We have developed a large-scale dataset for predicting antigen-antibody interactions in the variable domain of heavy chain of heavy chain antibodies (VHHs)
AVIDa-hIL6 contains 573,891 antigen-VHH pairs with amino acid sequences.
We report experimental benchmark results on AVIDa-hIL6 by using machine learning models.
- Score: 1.1381826108737396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Antibodies have become an important class of therapeutic agents to treat
human diseases. To accelerate therapeutic antibody discovery, computational
methods, especially machine learning, have attracted considerable interest for
predicting specific interactions between antibody candidates and target
antigens such as viruses and bacteria. However, the publicly available datasets
in existing works have notable limitations, such as small sizes and the lack of
non-binding samples and exact amino acid sequences. To overcome these
limitations, we have developed AVIDa-hIL6, a large-scale dataset for predicting
antigen-antibody interactions in the variable domain of heavy chain of heavy
chain antibodies (VHHs), produced from an alpaca immunized with the human
interleukin-6 (IL-6) protein, as antigens. By leveraging the simple structure
of VHHs, which facilitates identification of full-length amino acid sequences
by DNA sequencing technology, AVIDa-hIL6 contains 573,891 antigen-VHH pairs
with amino acid sequences. All the antigen-VHH pairs have reliable labels for
binding or non-binding, as generated by a novel labeling method. Furthermore,
via introduction of artificial mutations, AVIDa-hIL6 contains 30 different
mutants in addition to wild-type IL-6 protein. This characteristic provides
opportunities to develop machine learning models for predicting changes in
antibody binding by antigen mutations. We report experimental benchmark results
on AVIDa-hIL6 by using machine learning models. The results indicate that the
existing models have potential, but further research is needed to generalize
them to predict effective antibodies against unknown mutants. The dataset is
available at https://avida-hil6.cognanous.com.
Related papers
- Llama-Affinity: A Predictive Antibody Antigen Binding Model Integrating Antibody Sequences with Llama3 Backbone Architecture [2.474908349649168]
We present an advanced antibody-antigen binding affinity prediction model (Llamafinity)<n>The model achieved an accuracy of 0.9640, an F1-score of 0.9643, a precision of 0.9702, a recall of 0.9586, and an AUC-ROC of 0.9936.<n>This strategy unveiled higher computational efficiency, with a five-fold average cumulative training time of only 0.46 hours.
arXiv Detail & Related papers (2025-05-17T20:10:54Z) - dyAb: Flow Matching for Flexible Antibody Design with AlphaFold-driven Pre-binding Antigen [52.809470467635194]
Development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies.
Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process.
We introduce dyAb, a flexible framework that incorporates AlphaFold2-driven predictions to model pre-binding antigen structures.
arXiv Detail & Related papers (2025-03-01T03:53:18Z) - Leveraging Large Language Models to Predict Antibody Biological Activity Against Influenza A Hemagglutinin [0.15547733154162566]
We develop an AI model for predicting the binding and receptor blocking activity of antibodies against influenza A hemagglutininin (HA) antigens.
Our models achieved an AUROC $geq$ 0.91 for predicting the activity of existing antibodies against seen HAs and an AUROC of 0.9 for unseen HAs.
arXiv Detail & Related papers (2025-02-02T06:48:45Z) - Relation-Aware Equivariant Graph Networks for Epitope-Unknown Antibody Design and Specificity Optimization [61.06622479173572]
We propose a novel Relation-Aware Design (RAAD) framework, which models antigen-antibody interactions for co-designing sequences and structures of antigen-specific CDRs.
Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies.
arXiv Detail & Related papers (2024-12-14T03:00:44Z) - A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models [0.0]
We introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions.
VHHCorpus-2M, a pre-training dataset for antibody language models, contains over two million VHH sequences.
We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models.
arXiv Detail & Related papers (2024-05-29T04:22:18Z) - Large scale paired antibody language models [40.401345152825314]
We present IgBert and IgT5, the best performing antibody-specific language models developed to date.
These models are trained comprehensively using the more than two billion Observed Space dataset.
This advancement marks a significant leap forward in leveraging machine learning, large data sets and high-performance computing for enhancing antibody design for therapeutic development.
arXiv Detail & Related papers (2024-03-26T17:21:54Z) - Sequence-Based Nanobody-Antigen Binding Prediction [1.7284653203366596]
A critical challenge in nanobodies production is the unavailability of nanobodies for a majority of antigens.
This study aims to develop a machine-learning method to predict Nanobody-Antigen binding solely based on the sequence data.
arXiv Detail & Related papers (2023-07-15T02:00:19Z) - xTrimoABFold: De novo Antibody Structure Prediction without MSA [77.47606749555686]
We develop a novel model named xTrimoABFold to predict antibody structure from antibody sequence.
The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss.
arXiv Detail & Related papers (2022-11-30T09:26:08Z) - Incorporating Pre-training Paradigm for Antibody Sequence-Structure
Co-design [134.65287929316673]
Deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences.
The computational methods heavily rely on high-quality antibody structure data, which is quite limited.
Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data.
arXiv Detail & Related papers (2022-10-26T15:31:36Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - AntBO: Towards Real-World Automated Antibody Design with Combinatorial
Bayesian Optimisation [53.43922443725598]
We present AntBO: a Combinatorial optimisation algorithm enabling efficient in silico design of the CDRH3 region.
To benchmark AntBO, we use the Absolut! software suite as a black-box oracle because it can score the target specificity and affinity of designed antibodies in silico.
In under 200 protein designs, AntBO can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s.
arXiv Detail & Related papers (2022-01-29T12:03:04Z) - Accelerating Antimicrobial Discovery with Controllable Deep Generative
Models and Molecular Dynamics [109.70543391923344]
CLaSS (Controlled Latent attribute Space Sampling) is an efficient computational method for attribute-controlled generation of molecules.
We screen the generated molecules for additional key attributes by using deep learning classifiers in conjunction with novel features derived from atomistic simulations.
The proposed approach is demonstrated for designing non-toxic antimicrobial peptides (AMPs) with strong broad-spectrum potency.
arXiv Detail & Related papers (2020-05-22T15:57:58Z) - CogMol: Target-Specific and Selective Drug Design for COVID-19 Using
Deep Generative Models [74.58583689523999]
We propose an end-to-end framework, named CogMol, for designing new drug-like small molecules targeting novel viral proteins.
CogMol combines adaptive pre-training of a molecular SMILES Variational Autoencoder (VAE) and an efficient multi-attribute controlled sampling scheme.
CogMol handles multi-constraint design of synthesizable, low-toxic, drug-like molecules with high target specificity and selectivity.
arXiv Detail & Related papers (2020-04-02T18:17:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.