Related papers: AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions

AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions

URL: http://arxiv.org/abs/2306.03329v2
Date: Wed, 11 Oct 2023 00:42:26 GMT
Title: AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions
Authors: Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Jennifer N. Wei, Zelda Mariet, Poomarin Phloyphisut, Hidetoshi Shimokawa, Joseph R. Ledsam, Lucy Colwell, Akihiro Imura
Abstract summary: We have developed a large-scale dataset for predicting antigen-antibody interactions in the variable domain of heavy chain of heavy chain antibodies (VHHs) AVIDa-hIL6 contains 573,891 antigen-VHH pairs with amino acid sequences. We report experimental benchmark results on AVIDa-hIL6 by using machine learning models.
Score: 1.1381826108737396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Antibodies have become an important class of therapeutic agents to treat human diseases. To accelerate therapeutic antibody discovery, computational methods, especially machine learning, have attracted considerable interest for predicting specific interactions between antibody candidates and target antigens such as viruses and bacteria. However, the publicly available datasets in existing works have notable limitations, such as small sizes and the lack of non-binding samples and exact amino acid sequences. To overcome these limitations, we have developed AVIDa-hIL6, a large-scale dataset for predicting antigen-antibody interactions in the variable domain of heavy chain of heavy chain antibodies (VHHs), produced from an alpaca immunized with the human interleukin-6 (IL-6) protein, as antigens. By leveraging the simple structure of VHHs, which facilitates identification of full-length amino acid sequences by DNA sequencing technology, AVIDa-hIL6 contains 573,891 antigen-VHH pairs with amino acid sequences. All the antigen-VHH pairs have reliable labels for binding or non-binding, as generated by a novel labeling method. Furthermore, via introduction of artificial mutations, AVIDa-hIL6 contains 30 different mutants in addition to wild-type IL-6 protein. This characteristic provides opportunities to develop machine learning models for predicting changes in antibody binding by antigen mutations. We report experimental benchmark results on AVIDa-hIL6 by using machine learning models. The results indicate that the existing models have potential, but further research is needed to generalize them to predict effective antibodies against unknown mutants. The dataset is available at https://avida-hil6.cognanous.com.

Related papers

Llama-Affinity: A Predictive Antibody Antigen Binding Model Integrating Antibody Sequences with Llama3 Backbone Architecture [2.474908349649168]
We present an advanced antibody-antigen binding affinity prediction model (Llamafinity)<n>The model achieved an accuracy of 0.9640, an F1-score of 0.9643, a precision of 0.9702, a recall of 0.9586, and an AUC-ROC of 0.9936.<n>This strategy unveiled higher computational efficiency, with a five-fold average cumulative training time of only 0.46 hours.
arXiv Detail & Related papers (2025-05-17T20:10:54Z)
dyAb: Flow Matching for Flexible Antibody Design with AlphaFold-driven Pre-binding Antigen [52.809470467635194]
Development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies. Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process. We introduce dyAb, a flexible framework that incorporates AlphaFold2-driven predictions to model pre-binding antigen structures.
arXiv Detail & Related papers (2025-03-01T03:53:18Z)
Leveraging Large Language Models to Predict Antibody Biological Activity Against Influenza A Hemagglutinin [0.15547733154162566]
We develop an AI model for predicting the binding and receptor blocking activity of antibodies against influenza A hemagglutininin (HA) antigens. Our models achieved an AUROC $geq$ 0.91 for predicting the activity of existing antibodies against seen HAs and an AUROC of 0.9 for unseen HAs.
arXiv Detail & Related papers (2025-02-02T06:48:45Z)
Relation-Aware Equivariant Graph Networks for Epitope-Unknown Antibody Design and Specificity Optimization [61.06622479173572]
We propose a novel Relation-Aware Design (RAAD) framework, which models antigen-antibody interactions for co-designing sequences and structures of antigen-specific CDRs. Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies.
arXiv Detail & Related papers (2024-12-14T03:00:44Z)
A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models [0.0]
We introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions. VHHCorpus-2M, a pre-training dataset for antibody language models, contains over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models.
arXiv Detail & Related papers (2024-05-29T04:22:18Z)
Large scale paired antibody language models [40.401345152825314]
We present IgBert and IgT5, the best performing antibody-specific language models developed to date. These models are trained comprehensively using the more than two billion Observed Space dataset. This advancement marks a significant leap forward in leveraging machine learning, large data sets and high-performance computing for enhancing antibody design for therapeutic development.
arXiv Detail & Related papers (2024-03-26T17:21:54Z)
Sequence-Based Nanobody-Antigen Binding Prediction [1.7284653203366596]
A critical challenge in nanobodies production is the unavailability of nanobodies for a majority of antigens. This study aims to develop a machine-learning method to predict Nanobody-Antigen binding solely based on the sequence data.
arXiv Detail & Related papers (2023-07-15T02:00:19Z)
xTrimoABFold: De novo Antibody Structure Prediction without MSA [77.47606749555686]
We develop a novel model named xTrimoABFold to predict antibody structure from antibody sequence. The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss.
arXiv Detail & Related papers (2022-11-30T09:26:08Z)
Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design [134.65287929316673]
Deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. The computational methods heavily rely on high-quality antibody structure data, which is quite limited. Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data.
arXiv Detail & Related papers (2022-10-26T15:31:36Z)
Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency. Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance. In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z)
AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation [53.43922443725598]
We present AntBO: a Combinatorial optimisation algorithm enabling efficient in silico design of the CDRH3 region. To benchmark AntBO, we use the Absolut! software suite as a black-box oracle because it can score the target specificity and affinity of designed antibodies in silico. In under 200 protein designs, AntBO can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s.
arXiv Detail & Related papers (2022-01-29T12:03:04Z)
Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics [109.70543391923344]
CLaSS (Controlled Latent attribute Space Sampling) is an efficient computational method for attribute-controlled generation of molecules. We screen the generated molecules for additional key attributes by using deep learning classifiers in conjunction with novel features derived from atomistic simulations. The proposed approach is demonstrated for designing non-toxic antimicrobial peptides (AMPs) with strong broad-spectrum potency.
arXiv Detail & Related papers (2020-05-22T15:57:58Z)
CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models [74.58583689523999]
We propose an end-to-end framework, named CogMol, for designing new drug-like small molecules targeting novel viral proteins. CogMol combines adaptive pre-training of a molecular SMILES Variational Autoencoder (VAE) and an efficient multi-attribute controlled sampling scheme. CogMol handles multi-constraint design of synthesizable, low-toxic, drug-like molecules with high target specificity and selectivity.
arXiv Detail & Related papers (2020-04-02T18:17:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.