Related papers: Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

URL: http://arxiv.org/abs/2406.17185v1
Date: Mon, 24 Jun 2024 23:47:20 GMT
Title: Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
Authors: Koichi Akabe, Shunsuke Kanda, Yusuke Oda, Shinsuke Mori,
Abstract summary: This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework. Our approach optimize tokenization by leveraging the characteristics of the PLC framework and the task definition.
Score: 2.2125465557153756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal pre-processing methods for reducing actual score calculation. Thus, our approach makes the tokenization speed 5.7 times faster than the current approach based on the same model without decreasing tokenization accuracy. Our implementation is available at https://github.com/daac-tools/vaporetto under the MIT or Apache-2.0 license.

Related papers

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs) Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment. We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Learning Optimal Signal Temporal Logic Decision Trees for Classification: A Max-Flow MILP Formulation [5.924780594614676]
This paper presents a novel framework for inferring timed temporal logic properties from data. We formulate the inference process as a mixed integer linear programming optimization problem. Applying a max-flow algorithm on the resultant tree transforms the problem into a global optimization challenge. We conduct three case studies involving two-class, multi-class, and complex formula classification scenarios.
arXiv Detail & Related papers (2024-07-30T16:56:21Z)
Non-uniformity is All You Need: Efficient and Timely Encrypted Traffic Classification With ECHO [3.9154800026646566]
This paper introduces ECHO -- a novel optimization process for ML/DL-based encrypted traffic classification. ECHO targets both classification time and memory utilization and incorporates two innovative techniques.
arXiv Detail & Related papers (2024-06-03T23:54:48Z)
Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task. A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks. Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z)
BO-ICP: Initialization of Iterative Closest Point Based on Bayesian Optimization [3.248584983235657]
We present a new method based on Bayesian optimization for finding the critical initial ICP transform. We show that our approach outperforms state-of-the-art methods when given similar computation time. It is compatible with other improvements to ICP, as it focuses solely on the selection of an initial transform.
arXiv Detail & Related papers (2023-04-25T19:38:53Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Multiple Classifiers Based Maximum Classifier Discrepancy for Unsupervised Domain Adaptation [25.114533037440896]
We propose to extend the structure of two classifiers to multiple classifiers to further boost its performance. We demonstrate that, on average, adopting the structure of three classifiers normally yields the best performance as a trade-off between the accuracy and efficiency.
arXiv Detail & Related papers (2021-08-02T03:00:13Z)
Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z)
Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification. Our strategy enables important aspects of the base learner objective to be learned during meta-training. We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
An Extensive Experimental Evaluation of Automated Machine Learning Methods for Recommending Classification Algorithms (Extended Version) [4.400989370979334]
Three of these methods are based on Evolutionary Algorithms (EAs), and the other is Auto-WEKA, a well-known AutoML method. We performed controlled experiments where these four AutoML methods were given the same runtime limit for different values of this limit. In general, the difference in predictive accuracy of the three best AutoML methods was not statistically significant.
arXiv Detail & Related papers (2020-09-16T02:36:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.