Related papers: Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design

Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design

URL: http://arxiv.org/abs/2511.19423v1
Date: Mon, 24 Nov 2025 18:57:07 GMT
Title: Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design
Authors: Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, Simone Raugei,
Abstract summary: Genie-CAT is a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design.<n>System generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function.
Score: 0.8471442044818615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.

Related papers

BioLM-Score: Language-Prior Conditioned Probabilistic Geometric Potentials for Protein-Ligand Scoring [23.407269396970168]
We present BioLM-Score, a simple yet generalizable protein-ligand scoring model that couples modeling with representation learning.<n> Evaluations on the CASF-2016 benchmark demonstrate significant improvements across docking, scoring, ranking, and screening tasks.<n>In summary, BioLM-Score provides a principled and practical alternative to existing scoring functions, combining efficiency, generalization, and interpretability for structure-based drug discovery.
arXiv Detail & Related papers (2026-02-09T12:31:49Z)
An Agentic Framework for Autonomous Materials Computation [70.24472585135929]
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery.<n>Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific experiments.<n>Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations.
arXiv Detail & Related papers (2025-12-22T15:03:57Z)
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure [7.9473027178525975]
We modified ProteinMPNN to encode protein sequence and structural information in a unified way.<n>We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens.
arXiv Detail & Related papers (2025-02-07T05:23:16Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.<n>Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
A Transformer Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds [1.5628118690186594]
We present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model. Our model employs an encoder-decoder architecture and self-attention mechanisms to directly generate the most probable chemical structures. It performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%.
arXiv Detail & Related papers (2024-10-13T15:41:20Z)
X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design [0.0]
We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA) The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations. We develop a tailored X-LoRA model that offers scientific capabilities including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics and design.
arXiv Detail & Related papers (2024-02-11T10:23:34Z)
Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data. The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database. PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z)
Discovering Interpretable Physical Models using Symbolic Regression and Discrete Exterior Calculus [55.2480439325792]
We propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models. DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems. We prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data.
arXiv Detail & Related papers (2023-10-10T13:23:05Z)
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions. This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z)
Incorporating network based protein complex discovery into automated model construction [6.587739898387445]
We propose a method for gene expression based analysis of cancer phenotypes network incorporating knowledge through unsupervised construction of computational graphs. The structural construction of the computational graphs is driven by the use of topological clustering algorithms on protein-protein networks.
arXiv Detail & Related papers (2020-09-29T18:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.