Related papers: AlphaFold Database Debiasing for Robust Inverse Folding

AlphaFold Database Debiasing for Robust Inverse Folding

URL: http://arxiv.org/abs/2506.08365v1
Date: Tue, 10 Jun 2025 02:25:31 GMT
Title: AlphaFold Database Debiasing for Robust Inverse Folding
Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Siyuan Li, Yufei Huang, Stan Z. Li,
Abstract summary: We introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries.<n>At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance.
Score: 58.792020809180336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near-experimental accuracy, positioning it as a valuable resource for data-driven protein design. However, its direct use in training deep models that are sensitive to fine-grained atomic geometry, such as inverse folding, exposes a critical limitation. Comparative analysis of structural feature distributions reveals that AFDB structures exhibit distinct statistical regularities, reflecting a systematic geometric bias that deviates from the conformational diversity found in experimentally determined structures from the Protein Data Bank (PDB). While AFDB structures are cleaner and more idealized, PDB structures capture the intrinsic variability and physical realism essential for generalization in downstream tasks. To address this discrepancy, we introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries. By training the model to recover plausible structural states, DeSAE implicitly captures a more robust and natural structural manifold. At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance across multiple benchmarks. This work highlights the critical impact of subtle systematic biases in predicted structures and presents a principled framework for debiasing, significantly boosting the performance of structure-based learning tasks like inverse folding.

Related papers

DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z)
TopoFR: A Closer Look at Topology Alignment on Face Recognition [42.936929062768826]
We propose TopoFR, a novel FR model that leverages a topological structure alignment strategy called PTSA and a hard sample mining strategy named SDE. PTSA uses persistent homology to align the topological structures of the input and latent spaces, effectively preserving the structure information and improving the generalization performance of FR model. Experimental results on popular face benchmarks demonstrate the superiority of our TopoFR over the state-of-the-art methods.
arXiv Detail & Related papers (2024-10-14T14:58:30Z)
Fast and Reliable Probabilistic Reflectometry Inversion with Prior-Amortized Neural Posterior Estimation [73.81105275628751]
Finding all structures compatible with reflectometry data is computationally prohibitive for standard algorithms. We address this lack of reliability with a probabilistic deep learning method that identifies all realistic structures in seconds. Our method, Prior-Amortized Neural Posterior Estimation (PANPE), combines simulation-based inference with novel adaptive priors.
arXiv Detail & Related papers (2024-07-26T10:29:16Z)
Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction [43.46012602267272]
Protein structure-based property prediction has emerged as a promising approach for various biological tasks. Current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. Our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures.
arXiv Detail & Related papers (2023-10-14T08:43:42Z)
Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z)
dotears: Scalable, consistent DAG estimation using observational and interventional data [1.220743263007369]
Causal gene regulatory networks can be represented by directed acyclic graph (DAG) We present $texttdotears$ [doo-tairs], a continuous optimization framework to infer a single causal structure. We show that $texttdotears$ is a provably consistent estimator of the true DAG under mild assumptions.
arXiv Detail & Related papers (2023-05-30T17:03:39Z)
SFP: Spurious Feature-targeted Pruning for Out-of-Distribution Generalization [38.37530720506389]
We propose a novel Spurious Feature-targeted model Pruning framework, dubbed SFP, to automatically explore invariant substructures. SFP can significantly outperform both structure-based and non-structure-based OOD generalization SOTAs, with accuracy improvement up to 4.72% and 23.35%, respectively.
arXiv Detail & Related papers (2023-05-19T11:46:36Z)
StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure [5.2869308707704255]
StrAE is a Structured Autoencoder framework that through strict adherence to explicit structure, enables effective learning of multi-level representations. We show that our results are directly attributable to the informativeness of the structure provided as input, and show that this is not the case for existing tree models. We then extend StrAE to allow the model to define its own compositions using a simple localised-merge algorithm.
arXiv Detail & Related papers (2023-05-09T16:20:48Z)
RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline. We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z)
Exploring Optimal Substructure for Out-of-distribution Generalization via Feature-targeted Model Pruning [23.938392334438582]
We propose a novel Spurious Feature-targeted model Pruning framework, dubbed SFP, to automatically explore invariant substructures. SFP can significantly outperform both structure-based and non-structure OOD generalization SOTAs, with accuracy improvement up to 4.72% and 23.35%, respectively.
arXiv Detail & Related papers (2022-12-19T13:51:06Z)
Structural Bias for Aspect Sentiment Triplet Extraction [15.273669042985883]
Structural bias has been exploited for aspect sentiment triplet extraction (ASTE) and led to improved performance. It is recognized that explicitly incorporating structural bias would have a negative impact on efficiency, whereas pretrained language models (PLMs) can already capture implicit structures. We propose to address the efficiency issues by using an adapter to integrate structural bias in the PLM and using a cheap-to-compute relative position structure.
arXiv Detail & Related papers (2022-09-02T05:02:18Z)
Differentiable and Transportable Structure Learning [73.84540901950616]
We introduce D-Struct, which recovers transportability in the discovered structures through a novel architecture and loss function. Because D-Struct remains differentiable, our method can be easily adopted in existing differentiable architectures.
arXiv Detail & Related papers (2022-06-13T17:50:53Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.