Fugu-MT 論文翻訳(概要): Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

論文の概要: Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

arxiv url: http://arxiv.org/abs/2603.13201v1
Date: Fri, 13 Mar 2026 17:39:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.227174
Title: Neuron-Aware Data Selection In Instruction Tuning For Large Language Models
Title（参考訳）: 大規模言語モデルのためのインストラクションチューニングにおけるニューロン認識データ選択
Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Min Yang, Shujian Huang, Lidia S. Chao, Derek F. Wong,
Abstract要約: インストラクションチューニング(IT)は、大規模言語モデル(LLM)の強力な能力を解放するための効果的なアプローチであることが証明されている。近年の研究では、過剰なITデータがLCMのパフォーマンスを低下させる可能性がある一方で、高品質なITデータの小さなサブセットを慎重に選択することで、その能力を著しく向上させることができることが示されている。我々はNAITと呼ばれる新しい効率的なフレームワークを提案し、ITデータセットから最も効率的なサブセットデータを特定する。
参考スコア（独自算出の注目度）: 69.08560711834848
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.
Abstract（参考訳）: インストラクションチューニング(IT)は、大規模言語モデル(LLM)の強力な能力を解放するための効果的なアプローチであることが証明されている。近年の研究では、過剰なITデータがLCMのパフォーマンスを低下させる可能性がある一方で、高品質なITデータの小さなサブセットを慎重に選択することで、その能力を著しく向上させることができることが示されている。したがって、LLMの特定の能力または一般的な能力を効果的に開発するために、ITデータセットから最も効率的なサブセットデータを特定することは、重要な課題となっている。そこで本研究では,NAITと呼ばれる新しい,効率的なフレームワークを提案する。 NAITは、ITデータセットと対象ドメイン能力との間のニューロン活性化パターンの類似性を分析することにより、LLMのパフォーマンスに対するITデータの影響を評価する。具体的には、NAITは、ターゲットドメイン機能のドメイン内のデータセットからニューロンの活性化パターンをキャプチャして、再利用可能な、転送可能なニューロン活性化機能を構築する。次に、候補標本と目標能力の期待活性化特徴との類似性に基づいて最適なサンプルを評価し、選択する。実験の結果,NAITが選択した10\%のAlpaca-GPT4 ITデータサブセットのトレーニングは,外部の高度なモデルやさまざまなタスクにまたがる不確実性に基づく機能に依存する手法を一貫して上回っていることがわかった。また,LLMの異なる機能にまたがるニューロン活性化機能の伝達性も明らかにした。特に、より論理的な推論とプログラム的特徴を持つITデータは、強力な汎用的な転送可能性を持ち、モデルが複数のタスクにまたがるより強力な機能を開発することができる一方、安定したデータのコアサブセットは、基本的なモデルの機能を一貫して活性化し、多様なタスクのパフォーマンスを普遍的に改善するのに十分である。

論文の概要: Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

関連論文リスト