Fugu-MT 論文翻訳(概要): PrivCode: When Code Generation Meets Differential Privacy

論文の概要: PrivCode: When Code Generation Meets Differential Privacy

arxiv url: http://arxiv.org/abs/2512.05459v1
Date: Fri, 05 Dec 2025 06:27:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-13 22:40:56.91998
Title: PrivCode: When Code Generation Meets Differential Privacy
Title（参考訳）: PrivCode: コード生成が異なるプライバシに出会ったとき
Authors: Zheng Liu, Chen Gong, Terry Yue Zhuo, Kecen Li, Weichen Yu, Matt Fredrikson, Tianhao Wang,
Abstract要約: 異なるプライベートコード生成は、機密コードを保護する理論的保証を提供する。 PrivCodeは、コードデータセット用に特別に設計された最初のDPシンセサイザーである。プライバシとユーティリティの両方を改善するための2段階のフレームワークが組み込まれている。
参考スコア（独自算出の注目度）: 28.319022961888006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have presented outstanding performance in code generation and completion. However, fine-tuning these models on private datasets can raise privacy and proprietary concerns, such as the leakage of sensitive personal information. Differentially private (DP) code generation provides theoretical guarantees for protecting sensitive code by generating synthetic datasets that preserve statistical properties while reducing privacy leakage concerns. However, DP code generation faces significant challenges due to the strict syntactic dependencies and the privacy-utility trade-off. We propose PrivCode, the first DP synthesizer specifically designed for code datasets. It incorporates a two-stage framework to improve both privacy and utility. In the first stage, termed "privacy-sanitizing", PrivCode generates DP-compliant synthetic code by training models using DP-SGD while introducing syntactic information to preserve code structure. The second stage, termed "utility-boosting", fine-tunes a larger pre-trained LLM on the synthetic privacy-free code to mitigate the utility loss caused by DP, enhancing the utility of the generated code. Extensive experiments on four LLMs show that PrivCode generates higher-utility code across various testing tasks under four benchmarks. The experiments also confirm its ability to protect sensitive data under varying privacy budgets. We provide the replication package at the anonymous link.
Abstract（参考訳）: 大規模言語モデル(LLM)は、コード生成と補完において優れたパフォーマンスを示している。しかし、これらのモデルをプライベートデータセットに微調整することで、機密性の高い個人情報の漏洩など、プライバシやプロプライエタリな懸念が高まる可能性がある。差分的プライベート(DP)コード生成は、プライバシー漏洩の懸念を低減しつつ、統計特性を保存する合成データセットを生成することによって、機密コードを保護する理論的保証を提供する。しかし、DPコード生成は、厳密な構文上の依存関係とプライバシーとユーティリティのトレードオフのため、重大な課題に直面している。コードデータセットに特化して設計された最初のDPシンセサイザーであるPrivCodeを提案する。プライバシとユーティリティの両方を改善するための2段階のフレームワークが組み込まれている。プライバシー・サニタイズ(privacy-sanitizing)と呼ばれる第1段階では、PrivCodeは、DP-SGDを用いたトレーニングモデルを用いて、コード構造を保存するための構文情報を導入しながら、DP準拠の合成コードを生成する。第2段階は「ユーティリティブースティング(utility-boosting)」と呼ばれ、DPによる実用上の損失を軽減し、生成されたコードの有用性を高めるために、合成プライバシのないコードに対して、より大規模な事前訓練されたLCMを微調整する。 4つのLLMの大規模な実験により、PrivCodeは4つのベンチマークで様々なテストタスクにまたがって高いユーティリティコードを生成することが示された。実験はまた、さまざまなプライバシー予算の下で機密データを保護できることも確認した。匿名リンクでレプリケーションパッケージを提供する。

論文の概要: PrivCode: When Code Generation Meets Differential Privacy

関連論文リスト