Fugu-MT 論文翻訳(概要): LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

論文の概要: LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

arxiv url: http://arxiv.org/abs/2606.09865v1
Date: Mon, 01 Jun 2026 02:10:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:57.970575
Title: LLM-as-a-Discriminator: When Synthetic Tables Still Look Real
Title（参考訳）: LLM-as-a-Discriminator(動画)
Authors: Manel Slokom, Malek Slokom, Thierno Kante,
Abstract要約: LLMに各テーブルサンプルをREALまたはSyntheTICに分類するよう依頼する。 CTGAN, TVAE, および Gaussian Copula という3つのモデルを、UCI Adult と ACS Census という2つの公開データセット上で実行しています。この結果から, LLMの識別は, モデル選択, 提供者毎の報告, データの符号化処理を行う場合の, 実用的なプライバシ監査信号であることが示唆された。
参考スコア（独自算出の注目度）: 0.42481744176244507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at https://github.com/SlokomManel/LLM-as-a-Discriminator.
Abstract（参考訳）: プライバシとデータ共有は緊張関係にあることが多い。多くの組織は、合成データを使用してプライバシーリスクを低減し、有用なデータを共有している。表のデータについては、プライバシーの監査は依然として難しい。多くの場合、人間でさえテーブルが本物か合成されているかを容易に判別できない。本稿では,LLM識別に基づく手法を提案する。 LLMに各テーブルサンプルをREALまたはSyntheTICに分類するよう依頼する。 C1はテーブルのみで、C2はテーブルと分散メタデータです。 LLaMAをオープンモデルとして、Geminiをリファレンスモデルとして使用しています。実験では,CTGAN,TVAE,Gaussian Copulaの3つの合成モデルを,UCIアダルトとACSセンサスの2つの公開データセット上で実行した。有効な裁判は451件。結果から, モデルの違いが明らかとなった。成人では、LLaMAは報告細胞でDRS=0%、ジェミニはCTGANとTVAEでDRS=100%に達する。国勢調査では、LLaMAはほとんどのサンプルでSyntheTICを予測し、ジェミニはC1で高いが、CTGANとTVAEはC2で減少する。また,分類器2サンプルテスト (C2ST) と記録リンクを分布ベースラインとして比較し,アノテータ2基,試験240基のヒトパイロットとの比較を行った。この結果から, LLMの識別は, モデル選択, 提供者毎の報告, データの符号化処理を行う場合の, 実用的なプライバシ監査信号であることが示唆された。再現性については、https://github.com/SlokomManel/LLM-as-a-Discriminator.comで提供されている。

論文の概要: LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

関連論文リスト