Related papers: Protein Structure Tokenization via Geometric Byte Pair Encoding

Protein Structure Tokenization via Geometric Byte Pair Encoding

URL: http://arxiv.org/abs/2511.11758v1
Date: Thu, 13 Nov 2025 22:53:29 GMT
Title: Protein Structure Tokenization via Geometric Byte Pair Encoding
Authors: Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik,
Abstract summary: We introduce GeoBPE, a principled protein structure tokenizers (PSTs)<n>GeoBPE transforms continuous, noisy, multi-scale backbone conformations into discrete sentences'' of geometry while enforcing global constraints.<n>It offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization.
Score: 36.39587248348813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT-BPE/.

Related papers

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z)
TGSBM: Transformer-Guided Stochastic Block Model for Link Prediction [13.840265247620556]
Link prediction is a cornerstone of the Web ecosystem, powering applications from recommendation and search to knowledge graph completion and collaboration forecasting.<n>Existing approaches face notable limitations: traditional graph neural networks struggle to capture global dependencies, while recent graph transformers achieve strong performance but incur lack of interpretable structural structure.<n>We propose text-Guided Block Model, a framework that integrates the principled generative structure of Overlapping Block Models with the power of sparse Graph Transformers.
arXiv Detail & Related papers (2026-01-28T14:32:24Z)
Spectral Archaeology: The Causal Topology of Model Evolution [0.0]
Behavioral benchmarks tell us textitwhat a model does, but not textithow.<n>We introduce a training-free mechanistic probe using attention-graph spectra.<n>Across 12 models and 10 languages, these measures yield stable fingerprints'' that expose discontinuities missed by standard evaluation.
arXiv Detail & Related papers (2026-01-06T21:26:54Z)
BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch [61.20046418942948]
Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design.<n>We present BrepGPT, a single-stage autoregressive framework for B-rep generation.
arXiv Detail & Related papers (2025-11-27T07:16:53Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search [77.55575655986252]
ProtInvTree is a reward-guided tree-search framework for protein inverse folding.<n>It reformulates sequence generation as a deliberate, step-wise decision-making process.<n>It supports flexible test-time scaling by expanding the search depth and breadth without retraining.
arXiv Detail & Related papers (2025-06-01T09:34:20Z)
HoLa: B-Rep Generation using a Holistic Latent Representation [51.07878285790399]
We introduce a novel representation for learning and generating Computer-Aided Design (CAD) models in the form of $textitboundary representations$ (B-Reps)<n>Our representation unifies the continuous geometric properties of B-Rep primitives in different orders.<n>Our method significantly reduces ambiguities, redundancies, and incoherences among the generated B-Rep primitives.
arXiv Detail & Related papers (2025-04-19T10:34:24Z)
Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time [3.1789549088190414]
We study a distributed learning problem in which $n$ agents, each with potentially heterogeneous local data, collaboratively share the sum of their local cost functions via peer-to-peer communication.<n>We propose a novel, emph Tree PushPull- (STPP), which employs two trees extracted from a general communication graph to distribute both model parameters and topological parameters.
arXiv Detail & Related papers (2025-03-20T13:11:44Z)
DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry [3.859930277034918]
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD)<n>We propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation.
arXiv Detail & Related papers (2025-03-17T12:34:14Z)
Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting [52.364260925700485]
Transformer-based methods have achieved state-of-the-art performance in time series forecasting (TSF)<n>It remains unclear whether existing Transformers fully leverage the intrinsic topological structure among tokens throughout intermediate layers.<n>We propose the Topology Enhancement Method (TEM), a novel Transformer-based TSF method that explicitly and adaptively preserves token-level topology.
arXiv Detail & Related papers (2024-04-16T07:21:39Z)
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$. We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.