A Formal Perspective on Byte-Pair Encoding
        - URL: http://arxiv.org/abs/2306.16837v3
- Date: Mon, 2 Sep 2024 08:33:21 GMT
- Title: A Formal Perspective on Byte-Pair Encoding
- Authors: Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell, 
- Abstract summary: Byte-Pair imation (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.
We provide a faster implementation of BPE which improves the runtime complexity from $mathcalOleft(N Mright)$ to $mathcalOleft(N log Mright)$, where $N$ is the sequence length and $M$ is the merge count.
- Score: 100.75374173565548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$.   We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization. 
 
      
        Related papers
        - Variance-Reduced Fast Krasnoselkii-Mann Methods for Finite-Sum   Root-Finding Problems [8.0153031008486]
 We propose a new class of fast Krasnoselkii--Mann methods with variance reduction to solve a finite-sum co-coercive equation $Gx = 0$.
Our algorithm is single-loop and leverages a new family of unbiased variance-reduced estimators specifically designed for a wider class of root-finding algorithms.
 numerical experiments validate our algorithms and demonstrate their promising performance compared to state-of-the-art methods.
 arXiv  Detail & Related papers  (2024-06-04T15:23:29Z)
- Solving Dense Linear Systems Faster Than via Preconditioning [1.8854491183340518]
 We show that our algorithm has an $tilde O(n2)$ when $k=O(n0.729)$.
In particular, our algorithm has an $tilde O(n2)$ when $k=O(n0.729)$.
Our main algorithm can be viewed as a randomized block coordinate descent method.
 arXiv  Detail & Related papers  (2023-12-14T12:53:34Z)
- Faster Stochastic Algorithms for Minimax Optimization under
  Polyak--{\L}ojasiewicz Conditions [12.459354707528819]
 We propose SPIDER-GDA for solving the finite-sum problem of the form $min_x max_y f(x,y)triqangle frac1n sum_i=1n f_i(x,y)$.
We prove SPIDER-GDA could find an $epsilon$-optimal solution within $mathcal Oleft((n + sqrtn,kappa_xkappa_y2)log (1/epsilon)
 arXiv  Detail & Related papers  (2023-07-29T02:26:31Z)
- Memory-Constrained Algorithms for Convex Optimization via Recursive
  Cutting-Planes [23.94542304111204]
 First class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $epsilonleq 1/sqrt d$.
In the regime $epsilon leq d-Omega(d)$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.
 arXiv  Detail & Related papers  (2023-06-16T17:00:51Z)
- A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic
  Composite Optimization [10.762749887051546]
 We propose two-time scale algorithms: ProxDAS-A and Proxcal$DASA-GT.
Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-it operations, or stronger assumptions.
 arXiv  Detail & Related papers  (2023-02-20T05:16:18Z)
- An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum
  Optimization [25.21457349137344]
 We show a proof to show DEAREST requires at most $mathcal O(+sqrtmnLvarepsilon-2)$ first-order oracle (IFO) calls and $mathcal O(Lvarepsilon-2/sqrt1-lambda_W)$ communication rounds.
 arXiv  Detail & Related papers  (2022-10-25T11:37:11Z)
- Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
 We give a sketching-based iterative algorithm that computes $1+varepsilon$ approximate solutions for the ridge regression problem.
We also show that this algorithm can be used to give faster algorithms for kernel ridge regression.
 arXiv  Detail & Related papers  (2022-04-13T22:18:47Z)
- Computational Complexity of Normalizing Constants for the Product of
  Determinantal Point Processes [12.640283469603357]
 We study the computational complexity of computing the normalizing constant.
We show that $sum_Sdet(bf A_S,S)p$ exactly for every (fixed) positive even integer $p$ is UP-hard and Mod$_3$P-hard.
 arXiv  Detail & Related papers  (2021-11-28T14:08:25Z)
- Fast Graph Sampling for Short Video Summarization using Gershgorin Disc
  Alignment [52.577757919003844]
 We study the problem of efficiently summarizing a short video into several paragraphs, leveraging recent progress in fast graph sampling.
 Experimental results show that our algorithm achieves comparable video summarization as state-of-the-art methods, at a substantially reduced complexity.
 arXiv  Detail & Related papers  (2021-10-21T18:43:00Z)
- Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
 The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
 arXiv  Detail & Related papers  (2021-05-10T13:07:44Z)
- Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
 We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
 arXiv  Detail & Related papers  (2021-02-15T08:16:51Z)
- On Robust Optimal Transport: Computational Complexity, Low-rank
  Approximation, and Barycenter Computation [14.80695185915604]
 We consider two robust versions of optimal transport, named $textitRobust Semi-constrained Optimal Transport$ (RSOT) and $textitRobust Unconstrained Optimal Transport$ (ROT)
For both problems in the discrete settings, we propose Sinkhorn-based algorithms that produce $varepsilon$-approximations of RSOT and ROT in $widetildemathcalO(fracn2varepsilon)$ time.
 arXiv  Detail & Related papers  (2021-02-13T03:55:52Z)
- Streaming Complexity of SVMs [110.63976030971106]
 We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
 arXiv  Detail & Related papers  (2020-07-07T17:10:00Z)
- Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast
  Algorithm [100.11971836788437]
 We study the fixed-support Wasserstein barycenter problem (FS-WBP)
We develop a provably fast textitdeterministic variant of the celebrated iterative Bregman projection (IBP) algorithm, named textscFastIBP.
 arXiv  Detail & Related papers  (2020-02-12T03:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.