Fugu-MT 論文翻訳(概要): Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

論文の概要: Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

arxiv url: http://arxiv.org/abs/2509.02040v1
Date: Tue, 02 Sep 2025 07:35:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.944831
Title: Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation
Title（参考訳）: テキスト遺伝子としての属性:LLMを条件付き合成データ生成のための遺伝的アルゴリズムシミュレータとして活用する
Authors: Guangzeng Han, Weisi Liu, Xiaolei Huang,
Abstract要約: Genetic Promptは、遺伝的アルゴリズムとLarge Language Models(LLM)を組み合わせて合成データ生成を増強するフレームワークである。提案手法は, 意味的テキスト属性を遺伝子配列として扱い, LLMを利用して交叉・突然変異操作をシミュレートする。以上の結果から, 遺伝的プロンプトは, 広範囲のNLPアプリケーションに対して, 高品質な合成データを生成する有効な方法であることが明らかとなった。
参考スコア（独自算出の注目度）: 4.268367038882249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.
Abstract（参考訳）: 大規模言語モデル(LLM)は、合成データの生成において優れているが、その品質と多様性を保証することは依然として困難である。本稿では,遺伝子アルゴリズムとLLMを組み合わせた合成データ生成のための新しいフレームワークであるGenematic Promptを提案する。提案手法は, 意味的テキスト属性を遺伝子配列として扱い, LLMを利用して交叉・突然変異操作をシミュレートする。この遺伝的プロセスは、新しい属性の組み合わせを作成し、実際のデータに近い合成分布を生成することにより、データ品質と多様性を高める。また,親選択を最適化するために,子孫探索空間を拡張する能動的学習手法を統合する。遺伝的プロンプトは、最先端のベースラインを著しく上回るだけでなく、様々なジェネレータモデルのサイズやスケールで堅牢な性能を示す。さらに,従来のトレーニングセットと合成データを融合させることで,特にクラス不均衡シナリオにおいて,下流モデルの性能が著しく向上することが実証された。以上の結果から, 遺伝的プロンプトは, 広範囲のNLPアプリケーションに対して, 高品質な合成データを生成する有効な方法であることが明らかとなった。

論文の概要: Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

関連論文リスト