Related papers: Towards the Systematic Testing of Regular Expression Engines

Towards the Systematic Testing of Regular Expression Engines

URL: http://arxiv.org/abs/2603.00311v1
Date: Fri, 27 Feb 2026 21:00:31 GMT
Title: Towards the Systematic Testing of Regular Expression Engines
Authors: Berk Çakar, Dongyoon Lee, James C. Davis,
Abstract summary: ReTest is a framework that systematically tests regular expression engines.<n>It combines grammar-aware fuzzing for high code coverage with metamorphic testing to generate dialect-independent test oracles.<n>Our preliminary evaluation on PCRE shows that ReTest achieves 3x higher edge coverage than existing fuzzing approaches.
Score: 8.561133495117675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software engineers use regular expressions (regexes) across a wide range of domains and tasks. To support regexes, software projects must integrate a regex engine, whether provided natively by the language runtime (e.g., Python's re) or included as an external dependency (e.g., PCRE). However, these engines may contain bugs and introduce vulnerabilities. A common strategy for testing regex engines involves differential testing -- comparing outputs across different implementations. However, this approach is concerning because regex syntax and semantics vary significantly between dialects (e.g., POSIX vs. PCRE). Fuzzing is also utilized to ease testing of feature-rich regex implementations to expose defects, but naive byte-level mutations generate syntactically invalid inputs that exercise only parsing logic, not matching internals. In this work, we describe our progress towards ReTest, a framework that systematically tests regular expression engines by combining grammar-aware fuzzing for high code coverage with metamorphic testing to generate dialect-independent test oracles. So far, we have surveyed testing practices across 22 regex engines, analyzed 1,007 regex engine bugs and 156 CVEs to characterize failure modes, and curated 16 metamorphic relations for regexes derived from Kleene algebra. Our preliminary evaluation on PCRE shows that ReTest achieves 3x higher edge coverage than existing fuzzing approaches and has identified three new memory safety defects. We conclude by describing our next steps toward our ultimate goal: helping regex engine developers identify bugs without depending on a consistent cross-implementation standard.

Related papers

Protocol Testing with I/O Grammars [45.68497486907946]
We propose a novel approach to protocol testing that combines input generation and output checking in a single framework.<n>We demonstrate that I/O grammars can specify advanced protocol features correctly and completely, while also enabling output validation of the programs under test.
arXiv Detail & Related papers (2025-09-24T16:41:04Z)
Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies [7.304676960008862]
composinges is a common but challenging engineering activity.<n>Developers commonly reuse existing compositiones from sources.<n>No work to date has compared these various composition strategies.
arXiv Detail & Related papers (2025-03-26T14:25:27Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
REST: Retrieval-Based Speculative Decoding [69.06115086237207]
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation.
arXiv Detail & Related papers (2023-11-14T15:43:47Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
InfeRE: Step-by-Step Regex Generation via Chain of Inference [15.276963928784047]
In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of expressions into chains of step-by-step inference. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX.
arXiv Detail & Related papers (2023-08-08T04:37:41Z)
Improving Structured Text Recognition with Regular Expression Biasing [13.801707647700727]
We study the problem of recognizing structured text that follows certain formats. We propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing.
arXiv Detail & Related papers (2021-11-10T23:12:05Z)
Wasserstein Distance Regularized Sequence Representation for Text Matching in Asymmetrical Domains [51.91456788949489]
We propose a novel match method tailored for text matching in asymmetrical domains, called WD-Match. In WD-Match, a Wasserstein distance-based regularizer is defined to regularize the features vectors projected from different domains. The training process of WD-Match amounts to a game that minimizes the matching loss regularized by the Wasserstein distance.
arXiv Detail & Related papers (2020-10-15T12:52:09Z)
Benchmarking Multimodal Regex Synthesis with Complex Structures [45.35689345004124]
Existing datasets for regular expression (regex) generation from natural language are limited in complexity. We introduce StructuredRegex, a new synthesis dataset differing from prior ones in three aspects.
arXiv Detail & Related papers (2020-05-02T00:16:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.