Fugu-MT 論文翻訳(概要): Black-box Context-free Grammar Inference for Readable & Natural Grammars

論文の概要: Black-box Context-free Grammar Inference for Readable & Natural Grammars

arxiv url: http://arxiv.org/abs/2509.26616v1
Date: Tue, 30 Sep 2025 17:54:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.652167
Title: Black-box Context-free Grammar Inference for Readable & Natural Grammars
Title（参考訳）: 可読・自然文法のためのブラックボックス文脈自由文法推論
Authors: Mohammad Rifat Arefin, Shanto Rahman, Christoph Csallner,
Abstract要約: Arvada、TreeVada、Kedavraといった既存のツールは、大規模で複雑な言語でスケーラビリティ、可読性、正確性に苦慮している。本稿では,新しいLLM誘導文法推論フレームワークであるNatGIを紹介する。我々は,NatGIがF1スコアにおいて強いベースラインを一貫して上回っていることを示す。
参考スコア（独自算出の注目度）: 4.995853115126354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Black-box context-free grammar inference is crucial for program analysis, reverse engineering, and security, yet existing tools such as Arvada, TreeVada, and Kedavra struggle with scalability, readability, and accuracy on large, complex languages. We present NatGI, a novel LLM-guided grammar inference framework that extends TreeVada's parse tree recovery with three key innovations: bracket-guided bubble exploration, LLM- driven bubble generation and non-terminal labeling, and hierarchical delta debugging (HDD) for systematic tree simplification. Bracket-guided exploration leverages syntactic cues such as parentheses to propose well- structured grammar fragments, while LLM guidance produces meaningful non-terminal names and selects more promising merges. Finally, HDD incrementally reduces unnecessary rules, which makes the grammars both compact and interpretable. In our experiments, we evaluate NatGI on a comprehensive benchmark suite ranging from small languages to larger ones such as lua, c, and mysql. Our results show that NatGI consistently outperforms strong baselines in terms of F1 score. On average, NatGI achieves an F1 score of 0.57, which is 25pp (percentage points) higher than the best-performing baseline, TreeVada. In the case of interpretability, our generated grammars perform significantly better than those produced by existing approaches. Leveraging LLM-based node renaming and bubble exploration, NatGI produces rules with meaningful non-terminal names and compact structures that align more closely with human intuition. As a result, developers and researchers can achieve higher accuracy while still being able to easily inspect, verify, and reason about the structure and semantics of the induced grammars.
Abstract（参考訳）: ブラックボックスの文脈自由文法推論は、プログラム分析、リバースエンジニアリング、セキュリティに不可欠であるが、Arvada、TreeVada、Kedavraといった既存のツールでは、大規模で複雑な言語でのスケーラビリティ、可読性、精度に苦労している。そこで,本論文では,木箱誘導バブル探索,LCM駆動バブル生成,非終端ラベリング,階層的デルタデバッギング(HDD)という3つの重要なイノベーションを生かした,新しいLLM誘導文法推論フレームワークであるNatGIを紹介する。括弧誘導探索は、括弧などの構文的手がかりを利用して、よく構造化された文法的断片を提案する一方で、LCMガイダンスは意味のある非終端名を生成し、より有望なマージを選択する。最後に、HDDは不要な規則を漸進的に減らし、文法はコンパクトかつ解釈可能である。実験では,小さな言語からLua,c,mysqlなどの大規模言語まで,包括的なベンチマークスイートを用いてNatGIを評価する。以上の結果から,NatGIはF1スコアにおいて高いベースラインを一貫して上回っていることが明らかとなった。 NatGIの平均F1スコアは0.57で、これは最高のパフォーマンスのベースラインであるTreeVadaよりも25pp(パーセント)高い。解釈可能性の場合には,既存の手法よりも優れた文法が生成される。 LLMベースのノードリネームとバブル探索を活用して、NatGIは意味のある非終端名と人間の直感とより緊密なコンパクトな構造を持つルールを生成する。結果として、開発者や研究者は、推論された文法の構造と意味について、容易に検査、検証、推論することが可能でありながら、より高い精度を達成することができる。

論文の概要: Black-box Context-free Grammar Inference for Readable & Natural Grammars

関連論文リスト