De novo generation of functional terpene synthases using TpsGPT
- URL: http://arxiv.org/abs/2512.08772v2
- Date: Mon, 15 Dec 2025 05:09:36 GMT
- Title: De novo generation of functional terpene synthases using TpsGPT
- Authors: Hamsini Ramanathan, Roman Bushuiev, Matouš Soldát, Jirí Kohout, Téo Hebra, Joshua David Smith, Josef Sivic, Tomáš Pluskal,
- Abstract summary: Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products.<n>We introduce TpsGPT, a generative model for scalable TPS protein design built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt.
- Score: 14.65546198539368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
Related papers
- PepEDiff: Zero-Shot Peptide Binder Design via Protein Embedding Diffusion [3.9876702935151225]
We present PepEDiff, a novel peptide binder generator that designs binding sequences given a target receptor protein sequence and its pocket residues.<n>Our approach departs from existing methods by generating binder sequences directly in a continuous latent space derived from a pretrained protein embedding model.<n>Despite its simplicity, our method outperforms state-of-the-art approaches across benchmark tests and in the TIGIT case study.
arXiv Detail & Related papers (2026-01-19T19:07:32Z) - Dynamics-inspired Structure Hallucination for Protein-protein Interaction Modeling [60.57197355431804]
Protein-protein interaction (PPI) represents a central challenge within the biology field.<n>Deep learning has shown promise in forecasting the effects of such mutations, but is hindered by two primary constraints.<n>We present a novel framework named Refine-PPI with two key enhancements.
arXiv Detail & Related papers (2026-01-08T19:29:04Z) - Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design [61.2846583160056]
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains.<n>This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language.<n>We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences.
arXiv Detail & Related papers (2025-12-10T05:34:47Z) - A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification [2.327827051373412]
We present ESCAPE, an experimental framework integrating over 80.000 peptides from 27 validated repositories.<n>Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy.<n>Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides.
arXiv Detail & Related papers (2025-11-06T21:10:48Z) - Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model.<n>We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z) - MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome.
Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs.
We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z) - Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision [7.275932354889042]
We introduce a protein language model tailored to generate protein sequences with distinct properties.
We rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins.
We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation.
arXiv Detail & Related papers (2024-10-25T00:15:39Z) - PROflow: An iterative refinement model for PROTAC-induced structure prediction [4.113597666007784]
Proteolysis targeting chimeras (PROTACs) are small molecules that trigger the breakdown of traditionally undrug'' proteins by binding simultaneously to their targets and degradation-associated proteins.
A key challenge in their rational design is understanding their structural basis of activity.
Existing PROTAC docking methods have been forced to simplify the problem into a distance-constrained protein-protein docking task.
We develop a novel pseudo-data generation scheme that requires only binary protein-protein complexes.
This new dataset enables PROflow, an iterative refinement model for PROTAC-induced structure prediction that models the full PROTAC flexibility during constrained
arXiv Detail & Related papers (2024-04-10T05:29:35Z) - A Hierarchical Training Paradigm for Antibody Structure-sequence
Co-design [54.30457372514873]
We propose a hierarchical training paradigm (HTP) for the antibody sequence-structure co-design.
HTP consists of four levels of training stages, each corresponding to a specific protein modality.
Empirical experiments show that HTP sets the new state-of-the-art performance in the co-design problem.
arXiv Detail & Related papers (2023-10-30T02:39:15Z) - AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies [44.149969082612486]
AbDiffuser is an equivariant and physics-informed diffusion model for antibody 3D structures and sequences.
Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints.
Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set.
arXiv Detail & Related papers (2023-07-28T11:57:44Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Machine learning modeling of family wide enzyme-substrate specificity
screens [2.276367922551686]
Biocatalysis is a promising approach to synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale.
The adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates.
arXiv Detail & Related papers (2021-09-08T19:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.