Revisiting Regex Generation for Modeling Industrial Applications by
Incorporating Byte Pair Encoder
- URL: http://arxiv.org/abs/2005.02558v2
- Date: Wed, 24 Jun 2020 07:52:25 GMT
- Title: Revisiting Regex Generation for Modeling Industrial Applications by
Incorporating Byte Pair Encoder
- Authors: Desheng Wang, Jiawei Liu, Xiang Qi, Baolin Sun, Peng Zhang
- Abstract summary: This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem.
We first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions.
By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay.
- Score: 14.42244606935982
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Regular expression is important for many natural language processing tasks
especially when used to deal with unstructured and semi-structured data. This
work focuses on automatically generating regular expressions and proposes a
novel genetic algorithm to deal with this problem. Different from the methods
which generate regular expressions from character level, we first utilize byte
pair encoder (BPE) to extract some frequent items, which are then used to
construct regular expressions. The fitness function of our genetic algorithm
contains multi objectives and is solved based on evolutionary procedure
including crossover and mutation operation. In the fitness function, we take
the length of generated regular expression, the maximum matching characters and
samples for positive training samples, and the minimum matching characters and
samples for negative training samples into consideration. In addition, to
accelerate the training process, we do exponential decay on the population size
of the genetic algorithm. Our method together with a strong baseline is tested
on 13 kinds of challenging datasets. The results demonstrate the effectiveness
of our method, which outperforms the baseline on 10 kinds of data and achieves
nearly 50 percent improvement on average. By doing exponential decay, the
training speed is approximately 100 times faster than the methods without using
exponential decay. In summary, our method possesses both effectiveness and
efficiency, and can be implemented for the industry application.
Related papers
- Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Convolutional Sparse Coding Fast Approximation with Application to
Seismic Reflectivity Estimation [9.005280130480308]
We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations.
The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios.
arXiv Detail & Related papers (2021-06-29T12:19:07Z) - SparseGAN: Sparse Generative Adversarial Network for Text Generation [8.634962333084724]
We propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator.
With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentiable.
arXiv Detail & Related papers (2021-03-22T04:44:43Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Data-Driven Regular Expressions Evolution for Medical Text
Classification Using Genetic Programming [0.0]
This study proposes a novel regular expression-based text classification method making use of genetic programming (GP) approaches to evolve regular expressions.
Our method is evaluated with real-life medical text inquiries from an online healthcare provider and shows promising performance.
arXiv Detail & Related papers (2020-12-04T03:44:46Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.