Related papers: Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

URL: http://arxiv.org/abs/2005.02558v2
Date: Wed, 24 Jun 2020 07:52:25 GMT
Title: Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder
Authors: Desheng Wang, Jiawei Liu, Xiang Qi, Baolin Sun, Peng Zhang
Abstract summary: This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. We first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay.
Score: 14.42244606935982
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration. In addition, to accelerate the training process, we do exponential decay on the population size of the genetic algorithm. Our method together with a strong baseline is tested on 13 kinds of challenging datasets. The results demonstrate the effectiveness of our method, which outperforms the baseline on 10 kinds of data and achieves nearly 50 percent improvement on average. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay. In summary, our method possesses both effectiveness and efficiency, and can be implemented for the industry application.

Related papers

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade. We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z)
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z)
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol. We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z)
Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation [9.005280130480308]
We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios.
arXiv Detail & Related papers (2021-06-29T12:19:07Z)
SparseGAN: Sparse Generative Adversarial Network for Text Generation [8.634962333084724]
We propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator. With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentiable.
arXiv Detail & Related papers (2021-03-22T04:44:43Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Data-Driven Regular Expressions Evolution for Medical Text Classification Using Genetic Programming [0.0]
This study proposes a novel regular expression-based text classification method making use of genetic programming (GP) approaches to evolve regular expressions. Our method is evaluated with real-life medical text inquiries from an online healthcare provider and shows promising performance.
arXiv Detail & Related papers (2020-12-04T03:44:46Z)
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.