Fugu-MT 論文翻訳(概要): How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

論文の概要: How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

arxiv url: http://arxiv.org/abs/2603.06950v1
Date: Fri, 06 Mar 2026 23:52:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.487028
Title: How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences
Title（参考訳）: DNAの埋め込みはどの程度プライベートか? ゲノム配列の基盤モデル表現を反転させる
Authors: Sofiane Ouaari, Jules Kreuer, Nico Pfeifer,
Abstract要約: DNA基盤モデルは、バイオインフォマティクスや医療応用における変革的なツールとなっている。 Embeddings-as-a-Service (EBERT) フレームワークは Embeddings-as-a-Service を通じて共有されつつある。本研究では,DNA基盤モデルのインバージョン攻撃に対するレジリエンスを評価する。
参考スコア（独自算出の注目度）: 0.45880283710344055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.
Abstract（参考訳）: DNA基盤モデルは、バイオインフォマティクスや医療応用における変革的なツールとなっている。膨大なゲノムデータセットに基づいてトレーニングされたこれらのモデルは、複雑なゲノム情報をキャプチャするシーケンス埋め込み、密度の高いベクトル表現を生成するために使用することができる。これらの埋め込みは、下流タスクを容易にするためのEmbedddings-as-a-Service(EaaS)フレームワークを介して共有されつつ、基盤となる生のシーケンスのプライバシを保護していると考えられている。しかし、この慣行が普及するにつれて、これらの表現のセキュリティが問題視されている。本研究では,DNA基盤モデルのインバージョン攻撃に対するレジリエンスを評価し,敵はモデル出力からセンシティブなトレーニングデータを再構成しようと試みる。本研究では,DNA配列を再構成するためのモデルの出力はゼロショット埋め込みであり,デコーダに供給される。 DNABERT-2, Evo 2, Nucleotide Transformer v2 (NTv2) の3種類のDNA基盤モデルのプライバシを評価した。提案手法により,全モデルにまたがるほぼ完全な配列再構成が可能となった。平均プール埋め込みでは、配列の長さが増加するにつれて復元品質は低下するが、これはランダムなベースラインよりもかなり上である。 Evo 2 と NTv2 が最も脆弱であることが証明されており、特にDNABERT-2 の BPE トークン化は最大のレジリエンスをもたらす。組込み類似度とシーケンス類似度との相関が,再建成功の鍵となる予測因子であることがわかった。 EaaS設定に広く展開される前に、ゲノム基盤モデルでプライバシを意識した設計を緊急に必要とすることを強調した。トレーニングコード、モデルウェイト、評価パイプラインは、https://github.com/not-a-feature/DNA-Embedding-Inversionでリリースされている。

論文の概要: How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

関連論文リスト