Neuro-Symbolic Regex Synthesis Framework via Neural Example Splitting
- URL: http://arxiv.org/abs/2205.11258v1
- Date: Fri, 20 May 2022 05:55:24 GMT
- Title: Neuro-Symbolic Regex Synthesis Framework via Neural Example Splitting
- Authors: Su-Hyeon Kim, Hyunjoon Cheon, Yo-Sub Han, Sang-Ki Ko
- Abstract summary: We tackle the problem of learningconqueres faster from positive and negative strings by relying on a novel approach called neural example splitting'
Our approach essentially split up each example string into multiple parts using a neural network trained to group similar strings from positive strings.
We propose an effective synthesis framework called SplitRegex' that synthesizes subregexes from split' positives and produces the final by concatenating synthesized subregexes.
- Score: 8.076841611508488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the practical importance of regular expressions (regexes, for short),
there has been a lot of research to automatically generate regexes from
positive and negative string examples. We tackle the problem of learning
regexes faster from positive and negative strings by relying on a novel
approach called `neural example splitting'. Our approach essentially split up
each example string into multiple parts using a neural network trained to group
similar substrings from positive strings. This helps to learn a regex faster
and, thus, more accurately since we now learn from several short-length
strings. We propose an effective regex synthesis framework called `SplitRegex'
that synthesizes subregexes from `split' positive substrings and produces the
final regex by concatenating the synthesized subregexes. For the negative
sample, we exploit pre-generated subregexes during the subregex synthesis
process and perform the matching against negative strings. Then the final regex
becomes consistent with all negative strings. SplitRegex is a
divided-and-conquer framework for learning target regexes; split (=divide)
positive strings and infer partial regexes for multiple parts, which is much
more accurate than the whole string inferring, and concatenate (=conquer)
inferred regexes while satisfying negative strings. We empirically demonstrate
that the proposed SplitRegex framework substantially improves the previous
regex synthesis approaches over four benchmark datasets.
Related papers
- Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies [5.503553586086489]
Authors: Are composition tasks unique enough to merit dedicated machinery, or is reuse all we need?
We collect a novel dataset of composition tasks mined from GitHub and RegExLib.
Our evaluation uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches.
arXiv Detail & Related papers (2025-03-26T14:25:27Z) - WikiSplit++: Easy Data Refinement for Split and Rephrase [19.12982606032723]
Split and Rephrase splits a complex sentence into multiple simple sentences with the same meaning.
We create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences.
Our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.
arXiv Detail & Related papers (2024-04-13T13:07:32Z) - Correct and Optimal: the Regular Expression Inference Challenge [10.899596368151892]
We propose regular expression inference (REI) as a challenge for code/language modelling.
We generate and publish the first large-scale datasets for REI.
arXiv Detail & Related papers (2023-08-15T17:40:10Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - Cascading and Direct Approaches to Unsupervised Constituency Parsing on
Spoken Sentences [67.37544997614646]
We present the first study on unsupervised spoken constituency parsing.
The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees.
We show that accurate segmentation alone may be sufficient to parse spoken sentences accurately.
arXiv Detail & Related papers (2023-03-15T17:57:22Z) - Improving Structured Text Recognition with Regular Expression Biasing [13.801707647700727]
We study the problem of recognizing structured text that follows certain formats.
We propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing.
arXiv Detail & Related papers (2021-11-10T23:12:05Z) - FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions [5.21480688623047]
We present FOREST, a regular expression synthesizer for digital form validations.
Forestry produces a regular expression that matches the desired pattern for the input values.
We also present a new SMT encoding to synthesize capture conditions for a given regular expression.
arXiv Detail & Related papers (2020-12-28T14:06:01Z) - Benchmarking Multimodal Regex Synthesis with Complex Structures [45.35689345004124]
Existing datasets for regular expression (regex) generation from natural language are limited in complexity.
We introduce StructuredRegex, a new synthesis dataset differing from prior ones in three aspects.
arXiv Detail & Related papers (2020-05-02T00:16:09Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Multi-level Head-wise Match and Aggregation in Transformer for Textual
Sequence Matching [87.97265483696613]
We propose a new approach to sequence pair matching with Transformer, by learning head-wise matching representations on multiple levels.
Experiments show that our proposed approach can achieve new state-of-the-art performance on multiple tasks.
arXiv Detail & Related papers (2020-01-20T20:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.