Fugu-MT 論文翻訳(概要): Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0

論文の概要: Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0

arxiv url: http://arxiv.org/abs/2603.07091v1
Date: Sat, 07 Mar 2026 08:00:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.805322
Title: Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0
Title（参考訳）: ソフトウェアアーキテクチャにおける小言語モデルの推論深度を探る:ソフトウェア工学2.0に向けた多次元評価フレームワーク
Authors: Ha Vo, Nhut Tran, Khang Vo, Phat T. Tran-Truong, Son Ha,
Abstract要約: 本研究は、アーキテクチャ決定レコード生成に関する10の最先端のSLM(Small Language Models)をベンチマークする。 3B-パラメータのしきい値を超えるモデルは堅牢なゼロショット能力を示し、サブ2BモデルはファインチューニングによるBERTScoreの利得を示す。市販の小型モデルの高度な意味的多様性は、生産的な探索よりも幻覚と相関することが多い。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the era of "Software Engineering 2.0" (SE 2.0), where intelligent agents collaborate with human engineers, Generative AI is advancing beyond code generation into Software Architecture (SA). While Large Language Models (LLMs) demonstrate superior capabilities, computational costs and data privacy concerns drive interest in Small Language Models (SLMs) with fewer than 7 billion parameters. However, the reasoning limits of these resource-constrained models remain unexplored. This study benchmarks 10 state-of-the-art SLMs on Architectural Decision Records generation, introducing a multi-dimensional framework evaluating Technical Compliance and Semantic Diversity. Our empirical results reveal a significant reasoning gap: models above the 3B-parameter threshold demonstrate robust zero-shot capabilities, while sub-2B models show the strongest BERTScore gains from Fine-Tuning, though compliance improvements are not guaranteed. Contrary to assumptions regarding context saturation, Few-Shot prompting serves as a highly effective calibration mechanism for select mid-sized models with short context windows. Furthermore, high semantic diversity in off-the-shelf small models often correlates with hallucination rather than productive exploration. These findings establish a rigorous baseline for deploying sustainable, locally hosted architectural assistants.
Abstract（参考訳）: インテリジェントエージェントがヒューマンエンジニアと協力する"ソフトウェアエンジニアリング2.0"(SE 2.0)の時代に、ジェネレーティブAIはコード生成を越えてソフトウェアアーキテクチャ(SA)に進化しています。 LLM(Large Language Models)は優れた能力を示すが、計算コストとデータプライバシに関する懸念は、70億のパラメータ未満の小さな言語モデル(SLM)への関心を喚起する。しかし、これらの資源制約されたモデルの推論限界は未解明のままである。本研究は、アーキテクチャ決定レコード生成に関する10の最先端SLMをベンチマークし、技術的コンプライアンスとセマンティック多様性を評価する多次元フレームワークを導入する。 3Bパラメータのしきい値を超えるモデルは堅牢なゼロショット能力を示し、サブ2BモデルはファインチューニングによるBERTScoreの最大のゲインを示しているが、コンプライアンスの改善は保証されていない。文脈飽和に関する仮定とは対照的に、Few-Shotプロンプトは短いコンテキストウィンドウを持つ中型モデルを選択するための非常に効果的なキャリブレーション機構として機能する。さらに、市販の小型モデルにおける高い意味的多様性は、生産的な探索よりも幻覚と相関することが多い。これらの知見は、持続可能な、ローカルにホストされたアーキテクチャアシスタントを展開するための厳密なベースラインを確立する。

論文の概要: Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0

関連論文リスト