Inferring Input Grammars from Code with Symbolic Parsing
- URL: http://arxiv.org/abs/2503.08486v1
- Date: Tue, 11 Mar 2025 14:40:56 GMT
- Title: Inferring Input Grammars from Code with Symbolic Parsing
- Authors: Leon Bettscheider, Andreas Zeller,
- Abstract summary: Common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage.<n>In this work, we present the first technique for symbolically automatically generating input grammars from the code of descents.<n>The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.
- Score: 12.567395326774754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating effective test inputs for a software system requires that these inputs be valid, as they will otherwise be rejected without reaching actual functionality. In the absence of a specification for the input language, common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage. However, if sample inputs miss features of the input language, the chances of generating these features randomly are slim. In this work, we present the first technique for symbolically and automatically mining input grammars from the code of recursive descent parsers. So far, the complexity of parsers has made such a symbolic analysis challenging to impossible. Our realization of the symbolic parsing technique overcomes these challenges by (1) associating each parser function parse_ELEM() with a nonterminal <ELEM>; (2) limiting recursive calls and loop iterations, such that a symbolic analysis of parse_ELEM() needs to consider only a finite number of paths; and (3) for each path, create an expansion alternative for <ELEM>. Being purely static, symbolic parsing does not require seed inputs; as it mitigates path explosion, it scales to complex parsers. Our evaluation promises symbolic parsing to be highly accurate. Applied on parsers for complex languages such as TINY-C or JSON, our STALAGMITE implementation extracts grammars with an accuracy of 99--100%, widely improving over the state of the art despite requiring only the program code and no input samples. The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.
Related papers
- Inferring Attributed Grammars from Parser Implementations [1.0217990949413291]
We introduce a novel approach for inferring attributed grammars from implementations of input grammars.<n>By observing runtime executions and mapping the program's behavior to the grammar, we systematically extract and embed semantic actions into the grammar rules.<n>We demonstrate the feasibility of our approach using an initial set of programs, showing that it can accurately reproduce program behavior through the generated attributed grammars.
arXiv Detail & Related papers (2025-07-17T13:32:59Z) - Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA [0.0]
It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora.
It analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests.
arXiv Detail & Related papers (2025-03-13T15:42:44Z) - Automatic Prediction of the Performance of Every Parser [0.0]
We present a new performance prediction (PPP) model using machine translation performance prediction system (MTPPS)
This new system, MTPPS-PPP, can predict the performance of any language and can be useful for estimating the grammatical difficulty when understanding a text.
arXiv Detail & Related papers (2024-07-06T15:49:24Z) - MRL Parsing Without Tears: The Case of Hebrew [14.104766026682384]
In morphologically rich languages (MRLs), wheres need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity.
We present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task.
This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew tasks.
arXiv Detail & Related papers (2024-03-11T17:54:33Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - Improved Tree Search for Automatic Program Synthesis [91.3755431537592]
A key element is being able to perform an efficient search in the space of valid programs.
Here, we suggest a variant of MCTS that leads to state of the art results on two vastly different DSLs.
arXiv Detail & Related papers (2023-03-13T15:09:52Z) - SPINDLE: Spinning Raw Text into Lambda Terms with Graph Attention [0.8379286663107844]
The module transforms raw text input to programs for meaning composition, expressed as lambda terms.
Its output consists of hi-res derivations of a multimodal type-logical grammar.
arXiv Detail & Related papers (2023-02-23T14:22:45Z) - Compositional Task-Oriented Parsing as Abstractive Question Answering [25.682923914685063]
Task-oriented parsing aims to convert natural language into machine-readable representations of specific tasks, such as setting an alarm.
A popular approach to TOP is to apply seq2seq models to generate linearized parse trees.
A more recent line of work argues that pretrained seq2seq models are better at generating outputs that are themselves natural language, so they replace linearized parse trees with canonical natural-language paraphrases.
arXiv Detail & Related papers (2022-05-04T14:01:08Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Pareto Probing: Trading Off Accuracy for Complexity [87.09294772742737]
We argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance.
Our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
arXiv Detail & Related papers (2020-10-05T17:27:31Z) - A Simple Global Neural Discourse Parser [61.728994693410954]
We propose a simple chart-based neural discourse that does not require any manually-crafted features and is based on learned span representations only.
We empirically demonstrate that our model achieves the best performance among globals, and comparable performance to state-of-art greedys.
arXiv Detail & Related papers (2020-09-02T19:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.