Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation
- URL: http://arxiv.org/abs/2511.22311v1
- Date: Thu, 27 Nov 2025 10:42:52 GMT
- Title: Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation
- Authors: Fiona Y. Wang, Di Sheng Lee, David L. Kaplan, Markus J. Buehler,
- Abstract summary: Large language model (LLM) agents operate in parallel, each assigned to a specific residue position.<n>This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences.<n>Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training.
- Score: 0.9332987715848714
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.
Related papers
- SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers [50.18388227899971]
We present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture.<n>Experiments demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability.
arXiv Detail & Related papers (2026-02-06T13:50:13Z) - ProteinAE: Protein Diffusion Autoencoders for Structure Encoding [64.77182442408254]
We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder.<n>ProteinAE directly maps protein backbone coordinates from E(3) into a continuous, compact latent space.<n>We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders.
arXiv Detail & Related papers (2025-10-12T14:30:32Z) - UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - A Model-Centric Review of Deep Learning for Protein Design [0.0]
Deep learning has transformed protein design, enabling accurate structure prediction, sequence optimization, and de novo protein generation.<n>Generative models such as ProtGPT2, ProteinMPNN, and RFdiffusion have enabled sequence and backbone design beyond natural evolution-based limitations.<n>More recently, joint sequence-structure co-design models, including ESM3, have integrated both modalities into a unified framework, resulting in improved designability.
arXiv Detail & Related papers (2025-02-26T14:31:21Z) - OneProt: Towards Multi-Modal Protein Foundation Models [6.9092913261646345]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, text, and binding site data.<n>Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme.<n>This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z) - Structure Language Models for Protein Conformation Generation [66.42864253026053]
Traditional physics-based simulation methods often struggle with sampling equilibrium conformations.<n>Deep generative models have shown promise in generating protein conformations as a more efficient alternative.<n>We introduce Structure Language Modeling as a novel framework for efficient protein conformation generation.
arXiv Detail & Related papers (2024-10-24T03:38:51Z) - ProtAgents: Protein discovery via large language model multi-agent
collaborations combining physics and machine learning [0.0]
ProtAgents is a platform for de novo protein design based on Large Language Models (LLMs)
Multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment.
The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment unleashes great potentials.
arXiv Detail & Related papers (2024-01-27T20:19:49Z) - Progressive Multi-Modality Learning for Inverse Protein Folding [47.095862120116976]
We propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning.
MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge.
Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks.
arXiv Detail & Related papers (2023-12-11T10:59:23Z) - A Hierarchical Training Paradigm for Antibody Structure-sequence
Co-design [54.30457372514873]
We propose a hierarchical training paradigm (HTP) for the antibody sequence-structure co-design.
HTP consists of four levels of training stages, each corresponding to a specific protein modality.
Empirical experiments show that HTP sets the new state-of-the-art performance in the co-design problem.
arXiv Detail & Related papers (2023-10-30T02:39:15Z) - Generative Pretrained Autoregressive Transformer Graph Neural Network
applied to the Analysis and Discovery of Novel Proteins [0.0]
We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling.
The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks.
We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance.
arXiv Detail & Related papers (2023-05-07T12:30:24Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.