Fugu-MT 論文翻訳(概要): The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

論文の概要: The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

arxiv url: http://arxiv.org/abs/2605.20279v1
Date: Tue, 19 May 2026 04:41:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.263734
Title: The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets
Title（参考訳）: モデル崩壊の経済--合成データ市場における均衡・福祉・最適助成金-
Authors: Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov,
Abstract要約: 合成内容に対する再帰的な訓練は、測定可能でしばしば可逆的な分布の忠実さの損失を誘導する。我々は,モデル崩壊下での合成データ市場における最初の統一的ミクロ経済理論を開発した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative artificial intelligence is rapidly transforming the supply side of training data: an increasing share of new tokens, images, and structured records is produced by previous-generation models rather than by human originators. Recursive training on such synthetic content induces a measurable and often irreversible loss of distributional fidelity, a phenomenon known as model collapse. We develop the first unified microeconomic theory of synthetic data markets under model collapse. We introduce the Synthetic Data Contamination Equilibrium (SDCE), prove existence and generic uniqueness, derive a welfare decomposition W = W_prod + W_cons - L_coll - L_info, establish a Wasserstein-gradient-flow mean-field collapse limit, prove an impossibility of information-constrained implementation, and obtain closed-form expressions for the welfare-maximizing provenance subsidy s* = KL(q||p)/(2 kappa) and the welfare-maximizing watermark strength w* = (1 - psi) KL(q||p)/(2 kappa psi). We prove an information-theoretic Cramer-Rao lower bound on any provenance estimator using only producer-side observations and show that the Provenance-Market Iterative Retraining (PMIR) algorithm attains this bound up to constants while converging to an epsilon-SDCE in O(epsilon^-2 log T) iterations. A reduced-form OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b-hat = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. Calibrated experiments raise generation-ten model quality by 23.1 percent over the unregulated benchmark while lowering the 2-Wasserstein drift on a held-out diversity probe from 0.318 to 0.142. Scaling experiments over generations t in {1,...,10} recover a logarithmic-in-t collapse law log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962.
Abstract（参考訳）: 新たなトークン、画像、構造化されたレコードのシェアは、人間の起因者ではなく、前世代のモデルによって生成される。このような合成内容に対する再帰的な訓練は、しばしば可測かつ可逆的な分布フィデリティの損失を誘導する、これはモデル崩壊として知られる現象である。我々は,モデル崩壊下での合成データ市場における最初の統一的ミクロ経済理論を開発した。本稿では,Synthetic Data Contamination Equilibrium (SDCE) を導入し,福祉分解W = W_prod + W_cons - L_coll - L_info を導出し,Wasserstein-gradient-flow mean-field collapse limit を確立し,情報制約実装の不可能性を証明し,福祉最大化証明助成金 s* = KL(q||p)/(2 kappa) と福祉最大化透かし強度 w* = (1 - psi) KL(q||p)/(2 kappa psi) のクローズドフォーム式を得る。我々は,プロデューサ側の観測のみを用いて,情報理論のクレーマー・ラオ低境界を推定し,O(epsilon^-2 log T)反復においてエプシロン・SDCEに収束しながら,この境界を定数まで到達できることを示す。 10世代にわたるC4-syntheticベンチマークにおける縮小型OLS推定は、構造予測0.183の1つの標準誤差において、崩壊速度係数b-hat = 0.181(HAC s.e. 0.024)を得る。キャリブレーション実験では、調整されていないベンチマークで世代10モデルの品質が23.1%向上し、2-ワッサーシュタインドリフトは0.318から0.142に低下した。 R^2 = 0.962 の対数崩壊法則 log Q_t = log Q_0 - 0.183 t rho^2 を回復する。

論文の概要: The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

関連論文リスト