Fugu-MT 論文翻訳(概要): How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

論文の概要: How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

arxiv url: http://arxiv.org/abs/2511.03825v1
Date: Wed, 05 Nov 2025 19:45:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.200391
Title: How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
Title（参考訳）: 2値コード解析におけるトークン化アルゴリズムの違いがLLMとトランスフォーマーモデルに与える影響
Authors: Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder,
Abstract要約: その重要性にもかかわらず、アセンブリコードのコンテキストにおけるトークン化は未探索領域のままである。我々は、アセンブリコードのユニークな特徴に合わせて、プリプロセスのカスタマイズオプションとプリトークン化ルールについて検討する。我々は,トークン化効率,語彙圧縮,組立符号の表現忠実度に基づくトークン化器の比較を行った。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.
Abstract（参考訳）: トークン化は、アセンブリコード解析において基本的なものであり、語彙サイズ、セマンティックカバレッジ、下流タスクにおける外在的パフォーマンスといった固有の特性に影響を与える。その重要性にもかかわらず、アセンブリコードのコンテキストにおけるトークン化は未探索領域のままである。本研究では,自然言語処理(NLP)トークン化モデルと語彙サイズなどのパラメータ選択の本質的特性を評価することで,このギャップに対処することを目的とする。我々は、アセンブリコードのユニークな特徴に合わせて、プリプロセスのカスタマイズオプションとプリトークン化ルールについて検討する。さらに、関数シグネチャ予測のような下流タスクに対する影響も評価します。そこで我々は,様々なトークン化モデルについて徹底的な研究を行い,アセンブリ命令を符号化し,意味的ニュアンスをキャプチャする際の効率を体系的に分析する。固有の評価を通じて,トークン化効率,語彙圧縮,組立符号の表現忠実度に基づくトークン化器の比較を行う。本稿では,デコーダのみのLarge Language Model (LLM) Llama 3.2, エンコーダのみの変換器BERT, エンコーダ-デコーダモデルBARTといった最先端の事前訓練モデルを用いて, トークン化器の有効性を評価する。予備的な知見は、トークン化剤の選択が下流のパフォーマンスに大きく影響し、内在的指標が外因的評価結果の部分的かつ不完全な予測可能性を提供することを示している。これらの結果から,本質的なトークン化特性と実際のアセンブリコードタスクにおけるそれらの実用性との複雑なトレードオフが明らかとなった。最終的に、本研究では、低レベルコード解析のためのトークン化モデルの最適化に関する貴重な洞察を提供し、自然言語モデル(NLM)に基づくバイナリ分析ワークフローの堅牢性とスケーラビリティに寄与する。

論文の概要: How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

関連論文リスト