Fugu-MT 論文翻訳(概要): TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

論文の概要: TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

arxiv url: http://arxiv.org/abs/2510.17163v1
Date: Mon, 20 Oct 2025 05:05:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.311692
Title: TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
Title（参考訳）: TREAT: LLMsの信頼性/信頼性評価とテストフレームワーク
Authors: Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu,
Abstract要約: 本稿では,コードインテリジェンスタスクにおけるモデル性能の総合評価を行う評価フレームワークを提案する。評価フレームワークは、既存のアプローチにおける重要な制限に対処し、主な改善点を4つ挙げた。この評価枠組みに基づき,26の最先端モデルを評価し,その強度と限界を明らかにする。
参考スコア（独自算出の注目度）: 37.14734285161928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction. Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit;
Abstract（参考訳）: 大規模な基盤モデルは、ソフトウェアエンジニアリングのランドスケープを根本的に変え、コード生成、デバッグ、テストといった様々なタスクにまたがる例外的な能力を実証しています。この急速な進歩にもかかわらず、実際のソフトウェアエンジニアリングシナリオにおけるこれらのモデルの信頼性を包括的に評価する方法には、大きなギャップが残っている。既存のベンチマークはタスクの範囲が限られており、モデルの堅牢性や信頼性といった重要な評価要素を組み込むことができない。このギャップを埋めるために、コードインテリジェンスタスクにおけるモデルパフォーマンスの総合評価を提供する、TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) と呼ばれる評価フレームワークを提案する。評価フレームワークは,(1)限定的なコーディングタスクではなく,多様なソフトウェアエンジニアリング活動にまたがるマルチタスク全体評価,(2)従来の単一言語,テキストのみのベンチマークを超えて,多モーダルなコーディングタスクを含むようなマルチタスク全体評価,(3)意味的に保存されたコード変換によるモデルの信頼性を評価するロバストネス評価,(4)多様な評価プロンプトと適応的なソリューション抽出による評価結果の信頼性を高める厳密な評価方法論,の4つの点で,既存のアプローチにおける重要な限界に対処する。この評価フレームワークに基づいて、26の最先端モデルを評価し、その長所と短所の両方を解明し、いくつかの重要な洞察を得た。(1) 現在のモデルは、プログラミングタスク間での実質的なパフォーマンス変化を示し、(2)マルチモーダル言語モデルは、UIコードの生成と編集において、特定のパフォーマンスの限界を示す。

論文の概要: TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

関連論文リスト