Fugu-MT 論文翻訳(概要): Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

論文の概要: Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

arxiv url: http://arxiv.org/abs/2605.30039v1
Date: Thu, 28 May 2026 14:57:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.407936
Title: Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Title（参考訳）: 最小表現学習によるLLMのためのドメイン特化データ合成
Authors: Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang,
Abstract要約: 大規模言語モデルは、ドメイン固有のデータを微調整することで、特定のドメインで強力なパフォーマンスを達成することができる。既存のデータ合成アプローチは、自然言語で表現された明示的なドメイン記述と慎重にプロンプトエンジニアリングに依存している。本稿では、参照サンプルから最小限のドメイン表現を学習し、それを活用してドメイン整合合成データの生成を誘導する新しいフレームワークDOMINOを提案する。
参考スコア（独自算出の注目度）: 72.60775633696593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.
Abstract（参考訳）: 大規模言語モデルは汎用能力において顕著な進歩を示しており、ドメイン固有のデータを微調整することで、特定のドメインで強力なパフォーマンスを達成することができる。しかし、ターゲットドメインの高品質なデータを取得することは依然として大きな課題である。既存のデータ合成アプローチは、自然言語で表現された明示的なドメイン記述と注意深いプロンプトエンジニアリングに強く依存し、ドメインの記述や形式的な記述が難しい現実のシナリオにおける適用性を制限している。そこで本研究では,対象領域を参照例の集合でのみ定義する帰納的パラダイムを用いて,ドメイン固有のデータ合成の未探索問題に取り組み,特に自然言語でのドメイン特性の明瞭化が難しい場合に対処する。本稿では、参照サンプルから最小限のドメイン表現を学習し、それを活用してドメイン整合合成データの生成を誘導する新しいフレームワークDOMINOを提案する。 DOMINOは、プロンプトチューニングと対照的なアンタングルメントの目的を統合し、サンプル固有のノイズからドメインレベルのパターンを分離し、コアドメイン特性を保持しながらオーバーフィッティングを緩和する。理論的には、DOMINOが合成データ配信のサポートを拡大し、より多様性を確保できることを示す。実証的なことに、ドメイン定義が暗黙的なコーディングベンチマークでは、DOMINOによって合成されたデータの微調整により、強い命令で調整されたバックボーンよりも最大4.63倍の精度でPass@1が向上し、その有効性と堅牢性を示している。この研究は、ドメイン固有のデータ合成の新しいパラダイムを確立し、手動のプロンプト設計や自然言語のドメイン仕様を使わずに実用的でスケーラブルなドメイン適応を可能にする。

論文の概要: Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

関連論文リスト