Fugu-MT 論文翻訳(概要): ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

論文の概要: ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

arxiv url: http://arxiv.org/abs/2603.09692v1
Date: Tue, 10 Mar 2026 13:59:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.357097
Title: ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
Title（参考訳）: ActiveUltraFeedback: アクティブラーニングを用いた効率的な選好データ生成
Authors: Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause,
Abstract要約: RLHF(Reinforcement Learning from Human Feedback)は、大規模言語モデル(LLM)の整合の標準となっている。モジュール型アクティブラーニングパイプラインであるULULTRAFEEDBACKを導入し、不確実性推定を利用してアノテーションに対する最も情報性の高い応答を動的に識別する。
参考スコア（独自算出の注目度）: 28.150284620241422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) はLarge Language Models (LLM) の整合の標準となっているが、その効果は特に低リソースやエキスパートドメインにおいて、選好データを取得するコストが高いためにボトルネックになっている。これを解決するために,不確実性推定を利用したモジュール型アクティブ学習パイプラインであるACTIVEULTRAFEEDBACKを導入し,アノテーションに対する最も情報性の高い応答を動的に同定する。我々のパイプラインは,DRTS(Double Reverse THOMPSON SAMPling)とDELTAUCB(DELTAUCB)を併用した標準応答選択手法の体系的評価を容易にする。実験により、ACTIVEULTRAFEEDBACKは、下流の性能を大幅に向上させる高品質なデータセットを得られることを示した。パイプラインはhttps://github.com/lasgroup/ActiveUltraFeedbackで、好みのデータセットはhttps://huggingface.co/ActiveUltraFeedbackで利用可能です。

論文の概要: ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

関連論文リスト