Fugu-MT 論文翻訳(概要): Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing

論文の概要: Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing

arxiv url: http://arxiv.org/abs/2509.14335v1
Date: Wed, 17 Sep 2025 18:05:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:52.936098
Title: Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing
Title（参考訳）: 分類を超えて:細粒度自動マルウェア検査のためのLCMの評価
Authors: Xinran Zheng, Xingzhi Qian, Yiling He, Shuo Yang, Lorenzo Cavallaro,
Abstract要約: MalEvalは、きめ細かいAndroidマルウェア監査のための包括的なフレームワークである。近年のマルウェアおよび誤分類良性アプリのキュレートされたデータセットを用いて,広く使用されている7つのLCMを評価した。 MalEvalは、監査段階にまたがる有望な可能性とクリティカルな制限を明らかにしている。
参考スコア（独自算出の注目度）: 14.680014912507774
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Automated malware classification has achieved strong detection performance. Yet, malware behavior auditing seeks causal and verifiable explanations of malicious activities -- essential not only to reveal what malware does but also to substantiate such claims with evidence. This task is challenging, as adversarial intent is often hidden within complex, framework-heavy applications, making manual auditing slow and costly. Large Language Models (LLMs) could help address this gap, but their auditing potential remains largely unexplored due to three limitations: (1) scarce fine-grained annotations for fair assessment; (2) abundant benign code obscuring malicious signals; and (3) unverifiable, hallucination-prone outputs undermining attribution credibility. To close this gap, we introduce MalEval, a comprehensive framework for fine-grained Android malware auditing, designed to evaluate how effectively LLMs support auditing under real-world constraints. MalEval provides expert-verified reports and an updated sensitive API list to mitigate ground truth scarcity and reduce noise via static reachability analysis. Function-level structural representations serve as intermediate attribution units for verifiable evaluation. Building on this, we define four analyst-aligned tasks -- function prioritization, evidence attribution, behavior synthesis, and sample discrimination -- together with domain-specific metrics and a unified workload-oriented score. We evaluate seven widely used LLMs on a curated dataset of recent malware and misclassified benign apps, offering the first systematic assessment of their auditing capabilities. MalEval reveals both promising potential and critical limitations across audit stages, providing a reproducible benchmark and foundation for future research on LLM-enhanced malware behavior auditing. MalEval is publicly available at https://github.com/ZhengXR930/MalEval.git
Abstract（参考訳）: マルウェアの自動分類は強力な検出性能を達成した。しかし、マルウェアの行動監査は、悪意のある活動の因果関係と検証可能な説明を求める。逆行意図は複雑なフレームワークを多用するアプリケーションに隠されていることが多く、手動の監査が遅くてコストがかかるため、このタスクは難しい。大きな言語モデル(LLM)は、このギャップに対処するのに役立つが、その監査の可能性は、(1)フェアアセスメントのための細かいアノテーションが不足していること、(2)悪意のある信号を隠蔽する豊富な良性コード、(3)帰属の信頼性を損なう不確実な幻覚のアウトプットの3つの制限により、ほとんど解明されていない。このギャップを埋めるために,LLMが実世界の制約下での監査をいかに効果的にサポートするかを評価するために設計された,Androidマルウェアの詳細な監査のための包括的なフレームワークであるMalEvalを紹介した。 MalEvalは、専門家が検証したレポートと更新されたセンシティブなAPIリストを提供し、地上の真実の不足を軽減し、静的リーチビリティ分析を通じてノイズを低減する。関数レベルの構造表現は、検証可能な評価のための中間属性単位として機能する。これに基づいて、私たちは、機能優先、エビデンス属性、行動合成、サンプル識別の4つのアナリスト整合タスクと、ドメイン固有のメトリクスと統合されたワークロード指向スコアを定義します。我々は,最近のマルウェアと誤分類された良性アプリのキュレートされたデータセットを用いて,広く使用されている7つのLCMを評価し,監査機能の最初の体系的評価を行った。 MalEvalは、監査段階全体にわたる有望な可能性とクリティカルな制限を明らかにし、再現可能なベンチマークと、LLMが強化したマルウェアの振る舞い監査に関する将来の研究の基礎を提供する。 MalEvalはhttps://github.com/ZhengXR930/MalEval.gitで公開されている。

論文の概要: Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing

関連論文リスト