Related papers: Benchmarking Multimodal Regex Synthesis with Complex Structures

Benchmarking Multimodal Regex Synthesis with Complex Structures

URL: http://arxiv.org/abs/2005.00663v1
Date: Sat, 2 May 2020 00:16:09 GMT
Title: Benchmarking Multimodal Regex Synthesis with Complex Structures
Authors: Xi Ye, Qiaochu Chen, Isil Dillig and Greg Durrett
Abstract summary: Existing datasets for regular expression (regex) generation from natural language are limited in complexity. We introduce StructuredRegex, a new synthesis dataset differing from prior ones in three aspects.
Score: 45.35689345004124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.

Related papers

Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies [5.503553586086489]
Authors: Are composition tasks unique enough to merit dedicated machinery, or is reuse all we need? We collect a novel dataset of composition tasks mined from GitHub and RegExLib. Our evaluation uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches.
arXiv Detail & Related papers (2025-03-26T14:25:27Z)
Compositional Program Generation for Few-Shot Systematic Generalization [59.57656559816271]
This study on a neuro-symbolic architecture called the Compositional Program Generator (CPG) CPG has three key features: textitmodularity, textitcomposition, and textitabstraction, in the form of grammar rules. It perfect achieves generalization on both the SCAN and COGS benchmarks using just 14 examples for SCAN and 22 examples for COGS.
arXiv Detail & Related papers (2023-09-28T14:33:20Z)
Correct and Optimal: the Regular Expression Inference Challenge [10.899596368151892]
We propose regular expression inference (REI) as a challenge for code/language modelling. We generate and publish the first large-scale datasets for REI.
arXiv Detail & Related papers (2023-08-15T17:40:10Z)
Linear-Time Modeling of Linguistic Structure: An Order-Theoretic Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language. We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string. Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Explaining Patterns in Data with Language Models via Interpretable Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z)
Neuro-Symbolic Regex Synthesis Framework via Neural Example Splitting [8.076841611508488]
We tackle the problem of learningconqueres faster from positive and negative strings by relying on a novel approach called neural example splitting' Our approach essentially split up each example string into multiple parts using a neural network trained to group similar strings from positive strings. We propose an effective synthesis framework called SplitRegex' that synthesizes subregexes from split' positives and produces the final by concatenating synthesized subregexes.
arXiv Detail & Related papers (2022-05-20T05:55:24Z)
Improving Structured Text Recognition with Regular Expression Biasing [13.801707647700727]
We study the problem of recognizing structured text that follows certain formats. We propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing.
arXiv Detail & Related papers (2021-11-10T23:12:05Z)
FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions [5.21480688623047]
We present FOREST, a regular expression synthesizer for digital form validations. Forestry produces a regular expression that matches the desired pattern for the input values. We also present a new SMT encoding to synthesize capture conditions for a given regular expression.
arXiv Detail & Related papers (2020-12-28T14:06:01Z)
Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. We formulate the extractive summarization task as a semantic text matching problem. We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.