Fugu-MT 論文翻訳(概要): Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

論文の概要: Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

arxiv url: http://arxiv.org/abs/2511.13808v1
Date: Mon, 17 Nov 2025 17:46:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.741426
Title: Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora
Title（参考訳）: Zipf-Gramming:Byte N-Gramsを大規模Malwareコーパスにスケールアップ
Authors: Edward Raff, Ryan R. Curtin, Derek Everett, Robert J. Joyce, James Holt,
Abstract要約: 新しいトップkのn-gram抽出器は、以前のベストな代替品よりも35倍速い。新しいZipf-Grammingアルゴリズムを使用して、プロダクショントレーニングセットをスケールアップし、新しいマルウェアを検出するためのAUCを最大30%改善することができる。
参考スコア（独自算出の注目度）: 38.289536077642914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.
Abstract（参考訳）: 機能としてバイトn-gramを使用する分類器は、多数のマルウェア検出シナリオにデプロイするための、サイズ(サブ2MB)、速度(複数GB/s)、レイテンシ(サブ10ms)の要件を満たすのに十分な速度のアプローチである。しかし、6～8グラムは本番環境のデプロイメントにおいて最高の精度を実現していますが、テラバイト以上の実行可能プログラムで最も頻度の高いn-gramを見つけるコストが高いため、定期的に更新されたモデルをデプロイすることはできませんでした。 Zipfian分布はn-gramの分布をうまくモデル化するので、その性質を利用して、以前の最良の選択肢よりも35\times$以上の新しいトップkのn-gram抽出器を開発する。新しいZipf-Grammingアルゴリズムを使うことで、プロダクショントレーニングセットをスケールアップし、新しいマルウェアを検出するためのAUCを最大30%改善することができる。我々は,提案手法が誤りが少ないトップk項目を選択し,これらの結果を達成するために必要な理論と工学の相互作用を理論的かつ実証的に示す。

論文の概要: Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

関連論文リスト