Protein Structure Tokenization via Geometric Byte Pair Encoding
- URL: http://arxiv.org/abs/2511.11758v1
- Date: Thu, 13 Nov 2025 22:53:29 GMT
- Title: Protein Structure Tokenization via Geometric Byte Pair Encoding
- Authors: Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik,
- Abstract summary: We introduce GeoBPE, a principled protein structure tokenizers (PSTs)<n>GeoBPE transforms continuous, noisy, multi-scale backbone conformations into discrete sentences'' of geometry while enforcing global constraints.<n>It offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization.
- Score: 36.39587248348813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT-BPE/.
Related papers
- Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z) - TGSBM: Transformer-Guided Stochastic Block Model for Link Prediction [13.840265247620556]
Link prediction is a cornerstone of the Web ecosystem, powering applications from recommendation and search to knowledge graph completion and collaboration forecasting.<n>Existing approaches face notable limitations: traditional graph neural networks struggle to capture global dependencies, while recent graph transformers achieve strong performance but incur lack of interpretable structural structure.<n>We propose text-Guided Block Model, a framework that integrates the principled generative structure of Overlapping Block Models with the power of sparse Graph Transformers.
arXiv Detail & Related papers (2026-01-28T14:32:24Z) - Spectral Archaeology: The Causal Topology of Model Evolution [0.0]
Behavioral benchmarks tell us textitwhat a model does, but not textithow.<n>We introduce a training-free mechanistic probe using attention-graph spectra.<n>Across 12 models and 10 languages, these measures yield stable fingerprints'' that expose discontinuities missed by standard evaluation.
arXiv Detail & Related papers (2026-01-06T21:26:54Z) - BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch [61.20046418942948]
Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design.<n>We present BrepGPT, a single-stage autoregressive framework for B-rep generation.
arXiv Detail & Related papers (2025-11-27T07:16:53Z) - Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z) - ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search [77.55575655986252]
ProtInvTree is a reward-guided tree-search framework for protein inverse folding.<n>It reformulates sequence generation as a deliberate, step-wise decision-making process.<n>It supports flexible test-time scaling by expanding the search depth and breadth without retraining.
arXiv Detail & Related papers (2025-06-01T09:34:20Z) - HoLa: B-Rep Generation using a Holistic Latent Representation [51.07878285790399]
We introduce a novel representation for learning and generating Computer-Aided Design (CAD) models in the form of $textitboundary representations$ (B-Reps)<n>Our representation unifies the continuous geometric properties of B-Rep primitives in different orders.<n>Our method significantly reduces ambiguities, redundancies, and incoherences among the generated B-Rep primitives.
arXiv Detail & Related papers (2025-04-19T10:34:24Z) - Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time [3.1789549088190414]
We study a distributed learning problem in which $n$ agents, each with potentially heterogeneous local data, collaboratively share the sum of their local cost functions via peer-to-peer communication.<n>We propose a novel, emph Tree PushPull- (STPP), which employs two trees extracted from a general communication graph to distribute both model parameters and topological parameters.
arXiv Detail & Related papers (2025-03-20T13:11:44Z) - DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry [3.859930277034918]
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD)<n>We propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation.
arXiv Detail & Related papers (2025-03-17T12:34:14Z) - Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting [52.364260925700485]
Transformer-based methods have achieved state-of-the-art performance in time series forecasting (TSF)<n>It remains unclear whether existing Transformers fully leverage the intrinsic topological structure among tokens throughout intermediate layers.<n>We propose the Topology Enhancement Method (TEM), a novel Transformer-based TSF method that explicitly and adaptively preserves token-level topology.
arXiv Detail & Related papers (2024-04-16T07:21:39Z) - DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models.
We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$.
We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.