Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary
Restriction as Post Processing
- URL: http://arxiv.org/abs/2304.10808v1
- Date: Fri, 21 Apr 2023 08:29:14 GMT
- Title: Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary
Restriction as Post Processing
- Authors: Tatsuya Hiraoka, Tomoya Iwakura
- Abstract summary: This paper proposes a method to optimize tokenization for the performance improvement of already trained downstream models.
Our method generates tokenization results attaining lower loss values of a given downstream model on the training data for restricting vocabularies and trains a tokenizer reproducing the tokenization results.
- Score: 4.781986758380065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a method to optimize tokenization for the performance
improvement of already trained downstream models. Our method generates
tokenization results attaining lower loss values of a given downstream model on
the training data for restricting vocabularies and trains a tokenizer
reproducing the tokenization results. Therefore, our method can be applied to
variety of tokenization methods, while existing work cannot due to the
simultaneous learning of the tokenizer and the downstream model. This paper
proposes an example of the BiLSTM-based tokenizer with vocabulary restriction,
which can capture wider contextual information for the tokenization process
than non-neural-based tokenization methods used in existing work. Experimental
results on text classification in Japanese, Chinese, and English text
classification tasks show that the proposed method improves performance
compared to the existing methods for tokenization optimization.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition [6.767341847275751]
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair.
Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs)
Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA.
arXiv Detail & Related papers (2023-11-07T12:08:21Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model.
Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Obtaining Better Static Word Embeddings Using Contextual Embedding
Models [53.86080627007695]
Our proposed distillation method is a simple extension of CBOW-based training.
As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings.
arXiv Detail & Related papers (2021-06-08T12:59:32Z) - Joint Optimization of Tokenization and Downstream Model [22.336172850954938]
We propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model.
The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer.
We evaluate whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs.
arXiv Detail & Related papers (2021-05-26T09:05:10Z) - Lexically Constrained Neural Machine Translation with Levenshtein
Transformer [8.831954614241234]
This paper proposes a simple and effective algorithm for incorporating lexical constraints in neural machine translation.
Our method injects terminology constraints at inference time without any impact on decoding speed.
arXiv Detail & Related papers (2020-04-27T09:59:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.