Token Alignment via Character Matching for Subword Completion
- URL: http://arxiv.org/abs/2403.08688v1
- Date: Wed, 13 Mar 2024 16:44:39 GMT
- Title: Token Alignment via Character Matching for Subword Completion
- Authors: Ben Athiwaratkun, Shiqi Wang, Mingyue Shang, Yuchen Tian, Zijian Wang,
Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Rob Kwiatowski, Ramesh
Nallapati, Bing Xiang
- Abstract summary: This paper examines a technique to alleviate the tokenization artifact on text completion in generative models.
The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt.
- Score: 34.76794239097628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative models, widely utilized in various applications, can often
struggle with prompts corresponding to partial tokens. This struggle stems from
tokenization, where partial tokens fall out of distribution during inference,
leading to incorrect or nonsensical outputs. This paper examines a technique to
alleviate the tokenization artifact on text completion in generative models,
maintaining performance even in regular non-subword cases. The method, termed
token alignment, involves backtracking to the last complete tokens and ensuring
the model's generation aligns with the prompt. This approach showcases marked
improvement across many partial token scenarios, including nuanced cases like
space-prefix and partial indentation, with only a minor time increase. The
technique and analysis detailed in this paper contribute to the continuous
advancement of generative models in handling partial inputs, bearing relevance
for applications like code completion and text autocompletion.
Related papers
- Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - Empowering Character-level Text Infilling by Eliminating Sub-Tokens [34.37743927032878]
FIM-SE stands for Fill-In-the-Middle with both Starting and Ending character constraints.
We introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints.
arXiv Detail & Related papers (2024-05-27T12:21:48Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [68.68025991850115]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - Attributable and Scalable Opinion Summarization [79.87892048285819]
We generate abstractive summaries by decoding frequent encodings, and extractive summaries by selecting the sentences assigned to the same frequent encodings.
Our method is attributable, because the model identifies sentences used to generate the summary as part of the summarization process.
It scales easily to many hundreds of input reviews, because aggregation is performed in the latent space rather than over long sequences of tokens.
arXiv Detail & Related papers (2023-05-19T11:30:37Z) - Improving Tokenisation by Alternative Treatment of Spaces [7.596737214110957]
We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.
We find that our modified algorithms lead to improved performance on downstream NLP tasks.
arXiv Detail & Related papers (2022-04-08T13:22:30Z) - Counterfactual Multi-Token Fairness in Text Classification [0.0]
The concept of Counterfactual Generation has been extended to multi-token support valid over all forms of texts and documents.
We define the method of generating counterfactuals by perturbing multiple sensitive tokens as Counterfactual Multi-token Generation.
arXiv Detail & Related papers (2022-02-08T11:30:19Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.