Fugu-MT 論文翻訳(概要): LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

論文の概要: LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

arxiv url: http://arxiv.org/abs/2506.11237v1
Date: Thu, 12 Jun 2025 19:15:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-16 17:50:49.555166
Title: LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation
Title（参考訳）: LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation
Authors: Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin,
Abstract要約: 修復アクションの生成されたコードが構文的かつ意味論的に正しいかどうかを検証することは重要である。本研究では,双方向機能マッチングと論理表現を用いたLLM-as-a-Judgeの改良に焦点を当てた。結果は,実行ベース評価と高い精度と一致を示した。
参考スコア（独自算出の注目度）: 0.9176056742068815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code generation to select the best model for automatic incident remediation in IT Automation. We used execution-based evaluation as ground-truth to evaluate our LLM-as-a-Judge metrics. Results show high accuracy and agreement with execution-based evaluation (and up to 8% over baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.
Abstract（参考訳）: ITオートメーションにおいて、最良のモデルを自動的に評価し、選択し、自動インシデント修復のためのコード品質を向上させるために、修復アクションの生成されたコードが構文的に、意味的に正しいか、意図した通りに正しく実行できるかを検証することが不可欠である。アプローチは3つある。 1) 従来の手法では, 多数の制限のある表面形状類似度指標(トケンマッチ, 正確な一致など)を用いている。 2) 実行ベースの評価は、所定のテストケースに対するパス/フェイル判定に基づくコード機能に重点を置いている。 3) LLM-as-a-Judgeは, LLMを自動評価に使用して, 予め定義された基準に基づいて, 与えられた問題に対する正しい回答であるかどうかを判断する。本研究では,双方向機能マッチングと論理表現によるLLM-as-a-Judgeの強化に着目し,Bashコード生成のための参照なし自動検証と改良を行い,ITオートメーションにおける自動インシデント修復のための最良のモデルを選択する。 LLM-as-a-Judge測定値を評価するために,実行ベース評価を地中構造として使用した。結果は、実行ベース評価(ベースラインを最大8%上回る)と高い精度と一致を示します。最後に、評価指標から判断とフィードバックを利用するために、リフレクションコードエージェントを構築しました。

論文の概要: LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

関連論文リスト