Fugu-MT 論文翻訳(概要): Quantifying Memorization and Privacy Risks in Genomic Language Models

論文の概要: Quantifying Memorization and Privacy Risks in Genomic Language Models

arxiv url: http://arxiv.org/abs/2603.08913v1
Date: Mon, 09 Mar 2026 20:30:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:23.817159
Title: Quantifying Memorization and Privacy Risks in Genomic Language Models
Title（参考訳）: ゲノム言語モデルにおける記憶とプライバシーリスクの定量化
Authors: Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang, Jaideep Vaidya, Erman Ayday,
Abstract要約: ゲノム言語モデル(GLM)はDNA配列の表現を学習するための強力なツールとして登場した。 GLMはトレーニングデータから特定のシーケンスを記憶し、プライバシやデータ漏洩、規制コンプライアンスに関する深刻な懸念を提起するリスクがある。 GLMにおける暗記リスクの定量化を目的とした,包括的かつ多ベクトルなプライバシ評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 46.592953963976356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.
Abstract（参考訳）: ゲノム言語モデル(GLM)は、DNA配列の表現を学習する強力なツールとして登場し、変異予測、規制要素の識別、クロスタスクトランスファー学習の進歩を可能にしている。しかし、これらのモデルがセンシティブなゲノムコホートで訓練され、微調整されているため、トレーニングデータから特定のシーケンスを記憶し、プライバシやデータ漏洩、規制コンプライアンスに関する深刻な懸念を生じさせるリスクがある。汎用言語モデルにおける記憶リスクに対する認識の高まりにもかかわらず、固定されたヌクレオチドアルファベット、強い生物学的構造、個人識別性などのユニークな性質を示すゲノム領域において、これらのリスクに対する体系的な評価はほとんど存在しない。 GLMにおける暗記リスクの定量化を目的とした,包括的かつ多ベクトルなプライバシ評価フレームワークを提案する。提案手法は,パープレキシティに基づく検出,カナリアシークエンス抽出,メンバシップ推論の3つの相補的リスク評価手法を統合した。これらは統合評価パイプラインに統合され、最悪のケースの記憶リスクスコアを生成する。制御された評価を可能にするため、合成および実際のゲノムデータセットに異なる反復率でカナリア配列を植え込み、反復とトレーニングのダイナミクスが記憶に与える影響の正確な定量化を可能にした。我々は,複数のGLMアーキテクチャを対象としたフレームワークの評価を行い,シーケンス繰り返し,モデルキャパシティ,記憶リスクの関係について検討した。以上の結果から,GLMは測定可能な記憶度を示し,記憶度はアーキテクチャやトレーニング体制によって異なることが判明した。これらの結果は、単一の攻撃ベクトルがメモリ化リスクの全範囲を捉えておらず、ゲノムAIシステムの標準的なプラクティスとして、マルチベクタープライバシ監査の必要性が強調されていることを示している。

論文の概要: Quantifying Memorization and Privacy Risks in Genomic Language Models

関連論文リスト