Fugu-MT 論文翻訳(概要): Convex Optimization for Alignment and Preference Learning on a Single GPU

論文の概要: Convex Optimization for Alignment and Preference Learning on a Single GPU

arxiv url: http://arxiv.org/abs/2605.23244v1
Date: Fri, 22 May 2026 05:25:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.210138
Title: Convex Optimization for Alignment and Preference Learning on a Single GPU
Title（参考訳）: 単一GPU上でのアライメントと優先学習のための凸最適化
Authors: Miria Feng, Mert Pilanci,
Abstract要約: 人間の好みに合わせて微調整された大きな言語モデルは、GeminiやChatGPTといったシステムの成功を導いた。 DPO(Direct Preference Optimization)は、よりシンプルな代替手段を提供するが、一貫性のないランキング精度やGPUリソースへの高い依存といった制限がある。本稿では,理論的保証の強い新しい軽量戦略であるConvex Optimization for Alignment and Preference Learning Algorithm (COALA)を提案する。
参考スコア（独自算出の注目度）: 52.997197698288936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.
Abstract（参考訳）: 人間の好みに合わせて微調整された大型言語モデル(LLM)は、GeminiやChatGPTといったシステムの成功に繋がった。しかしながら、Reinforcement Learning from Human Feedback (RLHF)のようなアプローチは計算コストが高く複雑である。 DPO(Direct Preference Optimization)は、単純な代替手段を提供するが、一貫性のないランキング精度、GPUリソースへの高い依存、高価なハイパーパラメータチューニングなどの制限がある。本稿では,厳密な理論的保証を持つ新しい軽量戦略であるConvex Optimization for Alignment and Preference Learning Algorithm (COALA)を提案する。ニューラルネットワークの凸最適化の再構築を活用することで、COALAは参照モデルの必要性を排除し、トレーニング時間とVRAM使用量の大幅な削減を実現し、単一のGPU上で効率的なトレーニングを可能にする。 26621サンプルの総合的な教育フィードバックデータセットを含む4つのデータセットと、6つのモデル(Llama-3.1-8Bを含む)は、COALAの競争性能と効率を実証し、DPOのトータルTFLOPの約17.6%を活用している。 COALAは、DPOやORPOのような従来の方法と比較して安定して単調に増加し、ピークマージンを著しく短くする。我々の知る限り、LLMの微調整に凸最適化が効果的に適用されたのはこれが初めてである。

論文の概要: Convex Optimization for Alignment and Preference Learning on a Single GPU

関連論文リスト