Benchmarking Multimodal Regex Synthesis with Complex Structures
- URL: http://arxiv.org/abs/2005.00663v1
- Date: Sat, 2 May 2020 00:16:09 GMT
- Title: Benchmarking Multimodal Regex Synthesis with Complex Structures
- Authors: Xi Ye, Qiaochu Chen, Isil Dillig and Greg Durrett
- Abstract summary: Existing datasets for regular expression (regex) generation from natural language are limited in complexity.
We introduce StructuredRegex, a new synthesis dataset differing from prior ones in three aspects.
- Score: 45.35689345004124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing datasets for regular expression (regex) generation from natural
language are limited in complexity; compared to regex tasks that users post on
StackOverflow, the regexes in these datasets are simple, and the language used
to describe them is not diverse. We introduce StructuredRegex, a new regex
synthesis dataset differing from prior ones in three aspects. First, to obtain
structurally complex and realistic regexes, we generate the regexes using a
probabilistic grammar with pre-defined macros observed from real-world
StackOverflow posts. Second, to obtain linguistically diverse natural language
descriptions, we show crowdworkers abstract depictions of the underlying regex
and ask them to describe the pattern they see, rather than having them
paraphrase synthetic language. Third, we augment each regex example with a
collection of strings that are and are not matched by the ground truth regex,
similar to how real users give examples. Our quantitative and qualitative
analysis demonstrates the advantages of StructuredRegex over prior datasets.
Further experimental results using various multimodal synthesis techniques
highlight the challenge presented by our dataset, including non-local
constraints and multi-modal inputs.
Related papers
- Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies [5.503553586086489]
Authors: Are composition tasks unique enough to merit dedicated machinery, or is reuse all we need?
We collect a novel dataset of composition tasks mined from GitHub and RegExLib.
Our evaluation uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches.
arXiv Detail & Related papers (2025-03-26T14:25:27Z) - Compositional Program Generation for Few-Shot Systematic Generalization [59.57656559816271]
This study on a neuro-symbolic architecture called the Compositional Program Generator (CPG)
CPG has three key features: textitmodularity, textitcomposition, and textitabstraction, in the form of grammar rules.
It perfect achieves generalization on both the SCAN and COGS benchmarks using just 14 examples for SCAN and 22 examples for COGS.
arXiv Detail & Related papers (2023-09-28T14:33:20Z) - Correct and Optimal: the Regular Expression Inference Challenge [10.899596368151892]
We propose regular expression inference (REI) as a challenge for code/language modelling.
We generate and publish the first large-scale datasets for REI.
arXiv Detail & Related papers (2023-08-15T17:40:10Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - Neuro-Symbolic Regex Synthesis Framework via Neural Example Splitting [8.076841611508488]
We tackle the problem of learningconqueres faster from positive and negative strings by relying on a novel approach called neural example splitting'
Our approach essentially split up each example string into multiple parts using a neural network trained to group similar strings from positive strings.
We propose an effective synthesis framework called SplitRegex' that synthesizes subregexes from split' positives and produces the final by concatenating synthesized subregexes.
arXiv Detail & Related papers (2022-05-20T05:55:24Z) - Improving Structured Text Recognition with Regular Expression Biasing [13.801707647700727]
We study the problem of recognizing structured text that follows certain formats.
We propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing.
arXiv Detail & Related papers (2021-11-10T23:12:05Z) - FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions [5.21480688623047]
We present FOREST, a regular expression synthesizer for digital form validations.
Forestry produces a regular expression that matches the desired pattern for the input values.
We also present a new SMT encoding to synthesize capture conditions for a given regular expression.
arXiv Detail & Related papers (2020-12-28T14:06:01Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.