Fugu-MT 論文翻訳(概要): Synthetic Data for any Differentiable Target

論文の概要: Synthetic Data for any Differentiable Target

arxiv url: http://arxiv.org/abs/2604.08423v1
Date: Thu, 09 Apr 2026 16:23:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.021144
Title: Synthetic Data for any Differentiable Target
Title（参考訳）: 任意の微分可能なターゲットのための合成データ
Authors: Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto,
Abstract要約: 対象とするサンプルのデータセットを生成するために合成データジェネレータを正確に最適化できるプリミティブを開発する。提案手法は,高次勾配による正確なデータ帰属と,それらのスコアを政策勾配報酬として用いることにより,これを実現する。これらの結果から, DPGはモデル特性を合成訓練例のみを用いて形成するための強力で柔軟な手法であることが示唆された。
参考スコア（独自算出の注目度）: 59.540403676302994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
Abstract（参考訳）: 合成学習データによる言語モデル制御の限界は何か? 我々は、合成データジェネレータを正確に最適化し、対象とするサンプルのデータセットを作成することができる強化学習(RL)プリミティブであるデータセットポリシーグラディエント(DPG)を開発する。対象モデルの教師付き微調整(SFT)に使用する場合、これらの例は対象モデルを我々の選択の微分可能な計量でうまく動作させる。提案手法は,高次勾配による正確なデータ帰属と,それらのスコアを政策勾配報酬として用いることにより,これを実現する。本手法は, 合成データ生成装置の真で難解な勾配を近似する。 DPGの可能性を説明するために、生成された例としてSFTのみを用いて、ターゲットモデルのLM重み付けを(1)QRコード埋め込み、(2)パターン $\texttt{67}$ を埋め込み、(3)$\ell^2$ノルムを低くすることを示した。さらに、生成元が新しい言語で入力を(4)言い換え、(5) が特定の UUID を生成することを示せるが、どちらの目的も生成元の入力プロンプトに伝達されない。これらの結果から, DPGはモデル特性を合成訓練例のみで形成するための強力で柔軟な技術であることが示唆された。

論文の概要: Synthetic Data for any Differentiable Target

関連論文リスト