Flow Autoencoders are Effective Protein Tokenizers
- URL: http://arxiv.org/abs/2510.00351v1
- Date: Tue, 30 Sep 2025 23:29:39 GMT
- Title: Flow Autoencoders are Effective Protein Tokenizers
- Authors: Rohit Dilip, Evan Zhang, Ayush Varshney, David Van Valen,
- Abstract summary: We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures.<n>Kanzi consists of a diffusion autoencoder trained with a flow matching loss.<n>We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures. Kanzi consists of a diffusion autoencoder trained with a flow matching loss. We show that this approach simplifies several aspects of protein structure tokenizers: frame-based representations can be replaced with global coordinates, complex losses are replaced with a single flow matching loss, and SE(3)-invariant attention operations can be replaced with standard attention. We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers on reconstruction metrics at a fraction of the model size and training cost. An autoregressive model trained with Kanzi outperforms similar generative models that operate over tokens, although it does not yet match the performance of state-of-the-art continuous diffusion models. Code is available here: https://github.com/rdilip/kanzi/.
Related papers
- SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers [50.18388227899971]
We present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture.<n>Experiments demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability.
arXiv Detail & Related papers (2026-02-06T13:50:13Z) - Adaptive Protein Tokenization [0.0]
Existing protein structure tokenizers create tokens by pooling information from local neighborhoods.<n>We present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation.<n>We show how adaptive tokens enable inference criteria based on information content, which boosts designability.
arXiv Detail & Related papers (2026-02-06T06:15:14Z) - MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging [65.07273789940116]
This paper introduces a hierarchical architecture that jointly optimize a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks.<n> MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation.
arXiv Detail & Related papers (2025-11-17T19:27:41Z) - ProteinAE: Protein Diffusion Autoencoders for Structure Encoding [64.77182442408254]
We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder.<n>ProteinAE directly maps protein backbone coordinates from E(3) into a continuous, compact latent space.<n>We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders.
arXiv Detail & Related papers (2025-10-12T14:30:32Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z) - Learning the Language of Protein Structure [8.364087723533537]
We introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations.<n>To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures.
arXiv Detail & Related papers (2024-05-24T16:03:47Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical
Coarse-graining SO(3)-Equivariant Autoencoders [1.8835495377767553]
Three-dimensional native states of natural proteins display recurring and hierarchical patterns.
Traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution.
We introduce Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all-atom protein structures.
arXiv Detail & Related papers (2023-10-04T01:01:11Z) - Symbolic Regression by Exhaustive Search: Reducing the Search Space
Using Syntactical Constraints and Efficient Semantic Structure Deduplication [2.055204980188575]
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available.
In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues.
A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions.
arXiv Detail & Related papers (2021-09-28T17:47:51Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.