Fugu-MT 論文翻訳(概要): Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

論文の概要: Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

arxiv url: http://arxiv.org/abs/2508.13993v1
Date: Tue, 19 Aug 2025 16:33:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:32.017934
Title: Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Title（参考訳）: 腕としてのチャンク:長期LLM優先最適化のためのマルチArmed Bandit-Guided Smpling
Authors: Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun,
Abstract要約: LongMab-POは、長文モデリングタスクのための高品質で多様な応答を生成する新しいフレームワークである。実験の結果,LongMab-POは嗜好データペアの多様性と品質を著しく向上させることがわかった。
参考スコア（独自算出の注目度）: 56.97588709890706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.
Abstract（参考訳）: 長文モデリングは、長文質問応答、要約、複雑な推論タスクなど、幅広い現実世界のタスクに対して重要である。近年,Large Language Models (LLMs) の合成データを用いて,長期的文脈能力の向上を図っている。しかし、そのような手法の有効性は、しばしば、生成されたデータの低多様性と事実的不整合によって制限される。これらの課題に対処するため,Multi-Armed Bandit (MAB) のロールアウト戦略を活用する新しいフレームワークであるLongMab-POを提案する。具体的には、コンテキストチャンクをMABのアームとして扱い、期待される報酬スコアに基づいてチャンクを選択してLSMに入力して応答を生成し、報酬フィードバックに基づいてこれらのスコアを反復的に更新する。この探索と利用のプロセスにより、モデルは最も関連性の高いコンテキストセグメントに集中することができ、それによって高品質で多様な応答を生成および収集することができる。最後に、これらの生成した応答をロールアウトプロセスから収集し、DPO法を適用してLLMをさらに最適化する。実験の結果,LongMab-POは嗜好データペアの多様性と品質を著しく向上し,長文推論ベンチマークにおける最先端性能を実現していることがわかった。すべてのコードとデータはhttps://github.com/NEUIR/LongMab-POで公開される。

論文の概要: Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

関連論文リスト