Fugu-MT 論文翻訳(概要): Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

論文の概要: Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

arxiv url: http://arxiv.org/abs/2504.21023v1
Date: Wed, 23 Apr 2025 01:15:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-10 02:20:04.153371
Title: Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost
Title（参考訳）: Param$Δ$ for direct Weight Mixing: Post-Train Large Language Model at Zero Cost
Authors: Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, Zechun Liu,
Abstract要約: 大規模言語モデルの訓練後のフェーズは、指示追従、推論、人間の好みとの整合といった機能強化に不可欠である。本稿では,既存の学習後モデルからZERO追加トレーニングを施した新しいベースモデルに知識を伝達することで,学習後の合理化を図る新しい手法である$ParamDelta$を紹介する。
参考スコア（独自算出の注目度）: 40.38798099651626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $Param\Delta$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($\Theta_\text{post}$) and base model weights ($\Theta_\text{base}$), and adding this to the updated base model ($\Theta'_\text{base}$), we define $Param\Delta$ Model as: $\Theta_{\text{Param}\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base}$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $Param\Delta$ Model effectively replicates traditional post-training. For example, the $Param\Delta$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95\% of Llama3.1-inst model's performance on average. $Param\Delta$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.
Abstract（参考訳）: 大規模言語モデルのポストトレーニングフェーズは、指示追従、推論、人間の好みとの整合といった機能強化に不可欠である。しかし、それは広範囲にわたる高品質なデータを必要とし、オーバーフィッティングのようなリスクを伴い、各ベースモデル更新後の繰り返し後のトレーニングや評価によって計算コストが大幅に低下する。本稿では,既存の学習後モデルからZERO追加トレーニングを施した新しいベースモデルに知識を伝達することで,学習後の合理化を図る新しい手法である$Param\Delta$を紹介する。トレーニング後のモデルウェイト($\Theta_\text{post}$)とベースモデルウェイト($\Theta_\text{base}$)の違いを計算し、更新されたベースモデル($\Theta'_\text{base}$)にこれを追加することで、$Param\Delta$ Modelを次のように定義する。このアプローチは、新しいベースモデルにトレーニング後の機能を備え、直接トレーニング後のパフォーマンスに匹敵するパフォーマンスを実現する。 LLama3、Llama3.1、Qwen、DeepSeek蒸留モデルについて分析を行った。結果は、$Param\Delta$ Modelが従来のポストトレーニングを効果的に再現していることを示している。例えば、70B Llama3-inst、Llama3-base、Llama3.1-baseモデルから得られる$Param\Delta$モデルは、平均してLlama3.1-instモデルの性能の約95%に達する。 Param\Delta$は、モデル開発の反復サイクルを加速するコストフリーのフレームワークを提供することで、ベースおよびインストラクションモデルのチェックポイントが容易に利用でき、頻繁に更新される、オープンウェイトなコミュニティにおけるモデルを完全に活用する方法の新しい視点を提供する。

論文の概要: Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

関連論文リスト