Related papers: A Formal Perspective on Byte-Pair Encoding

A Formal Perspective on Byte-Pair Encoding

URL: http://arxiv.org/abs/2306.16837v3
Date: Mon, 2 Sep 2024 08:33:21 GMT
Title: A Formal Perspective on Byte-Pair Encoding
Authors: Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell,
Abstract summary: Byte-Pair imation (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. We provide a faster implementation of BPE which improves the runtime complexity from $mathcalOleft(N Mright)$ to $mathcalOleft(N log Mright)$, where $N$ is the sequence length and $M$ is the merge count.
Score: 100.75374173565548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

Related papers

Variance-Reduced Fast Krasnoselkii-Mann Methods for Finite-Sum Root-Finding Problems [8.0153031008486]
We propose a new class of fast Krasnoselkii--Mann methods with variance reduction to solve a finite-sum co-coercive equation $Gx = 0$. Our algorithm is single-loop and leverages a new family of unbiased variance-reduced estimators specifically designed for a wider class of root-finding algorithms. numerical experiments validate our algorithms and demonstrate their promising performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-06-04T15:23:29Z)
Memory-Constrained Algorithms for Convex Optimization via Recursive Cutting-Planes [23.94542304111204]
First class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $epsilonleq 1/sqrt d$. In the regime $epsilon leq d-Omega(d)$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.
arXiv Detail & Related papers (2023-06-16T17:00:51Z)
An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum Optimization [25.21457349137344]
We show a proof to show DEAREST requires at most $mathcal O(+sqrtmnLvarepsilon-2)$ first-order oracle (IFO) calls and $mathcal O(Lvarepsilon-2/sqrt1-lambda_W)$ communication rounds.
arXiv Detail & Related papers (2022-10-25T11:37:11Z)
Computational Complexity of Normalizing Constants for the Product of Determinantal Point Processes [12.640283469603357]
We study the computational complexity of computing the normalizing constant. We show that $sum_Sdet(bf A_S,S)p$ exactly for every (fixed) positive even integer $p$ is UP-hard and Mod$_3$P-hard.
arXiv Detail & Related papers (2021-11-28T14:08:25Z)
Fast Graph Sampling for Short Video Summarization using Gershgorin Disc Alignment [52.577757919003844]
We study the problem of efficiently summarizing a short video into several paragraphs, leveraging recent progress in fast graph sampling. Experimental results show that our algorithm achieves comparable video summarization as state-of-the-art methods, at a substantially reduced complexity.
arXiv Detail & Related papers (2021-10-21T18:43:00Z)
Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
On Robust Optimal Transport: Computational Complexity, Low-rank Approximation, and Barycenter Computation [14.80695185915604]
We consider two robust versions of optimal transport, named $textitRobust Semi-constrained Optimal Transport$ (RSOT) and $textitRobust Unconstrained Optimal Transport$ (ROT) For both problems in the discrete settings, we propose Sinkhorn-based algorithms that produce $varepsilon$-approximations of RSOT and ROT in $widetildemathcalO(fracn2varepsilon)$ time.
arXiv Detail & Related papers (2021-02-13T03:55:52Z)
Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model. We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm [100.11971836788437]
We study the fixed-support Wasserstein barycenter problem (FS-WBP) We develop a provably fast textitdeterministic variant of the celebrated iterative Bregman projection (IBP) algorithm, named textscFastIBP.
arXiv Detail & Related papers (2020-02-12T03:40:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.