Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies
- URL: http://arxiv.org/abs/2503.20579v1
- Date: Wed, 26 Mar 2025 14:25:27 GMT
- Title: Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies
- Authors: Berk Çakar, Charles M. Sale, Sophie Chen, Ethan H. Burmane, Dongyoon Lee, James C. Davis,
- Abstract summary: Authors: Are composition tasks unique enough to merit dedicated machinery, or is reuse all we need?<n>We collect a novel dataset of composition tasks mined from GitHub and RegExLib.<n>Our evaluation uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches.
- Score: 5.503553586086489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composing regular expressions (regexes) is a common but challenging engineering activity. Software engineers struggle with regex complexity, leading to defects, performance issues, and security vulnerabilities. Researchers have proposed tools to synthesize regexes automatically, and recent generative AI techniques are also promising. Meanwhile, developers commonly reuse existing regexes from Internet sources and codebases. In this study, we ask a simple question: are regex composition tasks unique enough to merit dedicated machinery, or is reuse all we need? We answer this question through a systematic evaluation of state-of-the-art regex reuse and synthesis strategies. We begin by collecting a novel dataset of regex composition tasks mined from GitHub and RegExLib (55,137 unique tasks with solution regexes). To address the absence of an automated regex reuse formulation, we introduce reuse-by-example, a Programming by Example (PbE) approach that leverages a curated database of production-ready regexes. Although all approaches can solve these composition tasks accurately, reuse-by-example and LLMs both do far better over the range of metrics we applied. Our evaluation then uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches: formal regex synthesizers and generative AI (LLMs). Although all approaches can solve these composition tasks accurately, reuse and LLMs both do far better over the range of metrics we applied. Ceteris paribus, prefer the cheaper solution -- for regex composition, perhaps reuse is all you need. Our findings provide actionable insights for developers selecting regex composition strategies and inform the design of future tools to improve regex reliability in software systems.
Related papers
- In-Context Learning for Extreme Multi-Label Classification [29.627891261947536]
Multi-label classification problems with thousands of classes are hard to solve with in-context learning alone.
We propose a general program that defines multi-step interactions between LMs and retrievers to efficiently tackle such problems.
Our solution requires no finetuning, is easily applicable to new tasks, alleviates prompt engineering, and requires only tens of labeled examples.
arXiv Detail & Related papers (2024-01-22T18:09:52Z) - Compositional Program Generation for Few-Shot Systematic Generalization [59.57656559816271]
This study on a neuro-symbolic architecture called the Compositional Program Generator (CPG)
CPG has three key features: textitmodularity, textitcomposition, and textitabstraction, in the form of grammar rules.
It perfect achieves generalization on both the SCAN and COGS benchmarks using just 14 examples for SCAN and 22 examples for COGS.
arXiv Detail & Related papers (2023-09-28T14:33:20Z) - Toward Unified Controllable Text Generation via Regular Expression
Instruction [56.68753672187368]
Our paper introduces Regular Expression Instruction (REI), which utilizes an instruction-based mechanism to fully exploit regular expressions' advantages to uniformly model diverse constraints.
Our method only requires fine-tuning on medium-scale language models or few-shot, in-context learning on large language models, and requires no further adjustment when applied to various constraint combinations.
arXiv Detail & Related papers (2023-09-19T09:05:14Z) - Correct and Optimal: the Regular Expression Inference Challenge [10.899596368151892]
We propose regular expression inference (REI) as a challenge for code/language modelling.
We generate and publish the first large-scale datasets for REI.
arXiv Detail & Related papers (2023-08-15T17:40:10Z) - ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis.
We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z) - Enriching Relation Extraction with OpenIE [70.52564277675056]
Relation extraction (RE) is a sub-discipline of information extraction (IE)
In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE.
Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models.
arXiv Detail & Related papers (2022-12-19T11:26:23Z) - Neuro-Symbolic Regex Synthesis Framework via Neural Example Splitting [8.076841611508488]
We tackle the problem of learningconqueres faster from positive and negative strings by relying on a novel approach called neural example splitting'
Our approach essentially split up each example string into multiple parts using a neural network trained to group similar strings from positive strings.
We propose an effective synthesis framework called SplitRegex' that synthesizes subregexes from split' positives and produces the final by concatenating synthesized subregexes.
arXiv Detail & Related papers (2022-05-20T05:55:24Z) - Three Sentences Are All You Need: Local Path Enhanced Document Relation
Extraction [54.95848026576076]
We present an embarrassingly simple but effective method to select evidence sentences for document-level RE.
We have released our code at https://github.com/AndrewZhe/Three-Sentences-Are-All-You-Need.
arXiv Detail & Related papers (2021-06-03T12:29:40Z) - FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions [5.21480688623047]
We present FOREST, a regular expression synthesizer for digital form validations.
Forestry produces a regular expression that matches the desired pattern for the input values.
We also present a new SMT encoding to synthesize capture conditions for a given regular expression.
arXiv Detail & Related papers (2020-12-28T14:06:01Z) - Benchmarking Multimodal Regex Synthesis with Complex Structures [45.35689345004124]
Existing datasets for regular expression (regex) generation from natural language are limited in complexity.
We introduce StructuredRegex, a new synthesis dataset differing from prior ones in three aspects.
arXiv Detail & Related papers (2020-05-02T00:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.