Fugu-MT 論文翻訳(概要): DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression

論文の概要: DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression

arxiv url: http://arxiv.org/abs/2509.01221v2
Date: Thu, 04 Sep 2025 09:30:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-05 11:58:39.453145
Title: DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression
Title（参考訳）: DaMoC:データとモデル圧縮に基づく微調整ドメインタスクのための最適大言語モデルの選択
Authors: Wei Huang, Huang Wei, Yinggui Wang,
Abstract要約: 大規模言語モデル(LLM)は、一般的なタスクでは優れているが、ドメイン固有のタスクでは苦労し、特定のデータに対して微調整を必要とする。この課題に対処するデータ・モデル圧縮フレームワーク(DaMoC)を導入します。トレーニング時間に約20倍の時間を節約しながら,最適なLLMを選択することができることを示す。
参考スコア（独自算出の注目度）: 7.1654056866441245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer's importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model's capability as possible. Extensive experiments on four datasets, medical Q&A, financial Q&A, general Q&A, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.
Abstract（参考訳）: 大規模言語モデル(LLM)は、一般的なタスクでは優れているが、ドメイン固有のタスクでは苦労し、特定のデータに対して微調整を必要とする。多くのオープンソース LLM が利用可能であり、ダウンストリームタスクを微調整するための最良のモデルを選択することは困難であり、主に最適な LLM を素早く識別する方法に焦点を当てている。私たちは、この課題に対処するData and Model Compression Framework(DaMoC)を紹介します。 1)データレベル: LLMのデータフィルタリング手法の体系的な分類が最初に確立され,(1)分布認識法,(2)品質認識法,(3)両次元を考慮したハイブリッドアプローチの3つのパラダイムに分類される。さらに、トークン圧縮を実現するテキストにおいて、キートークンの密度を高める。その後、LLMを使用してテキストを反復的に書き直し、その表現を最適化する。 2) モデルレベル: 各レイヤの重要性を評価し、より重要度の高いものを取り除くために、レイヤの類似度スコアを使用します。そこで,本研究では,オリジナルモデルの能力を最大限に維持するために,スパースマージパラダイムを導入する。医療用Q&A,財務用Q&A,一般Q&A,読解の4つのデータセットに対する大規模な実験により,学習時間を約20倍に抑えながら最適なLCMを選択することができた。

論文の概要: DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression

関連論文リスト