Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
- URL: http://arxiv.org/abs/2411.17679v1
- Date: Tue, 26 Nov 2024 18:44:39 GMT
- Title: Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
- Authors: Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang,
- Abstract summary: Token Internal Position Awareness (TIPA) is a novel approach that enhances LLMs' understanding of internal token structures.
TIPA enables models to effectively learn and generalize character positions and internal structures.
- Score: 20.100484034021285
- License:
- Abstract: Tokenization techniques such as Byte-Pair Encoding (BPE) and Byte-Level BPE (BBPE) have significantly improved the computational efficiency and vocabulary representation stability of large language models (LLMs) by segmenting text into tokens. However, this segmentation often obscures the internal character structures and sequences within tokens, preventing models from fully learning these intricate details during training. Consequently, LLMs struggle to comprehend the character compositions and positional relationships within tokens, especially when fine-tuned on downstream tasks with limited data. In this paper, we introduce Token Internal Position Awareness (TIPA), a novel approach that enhances LLMs' understanding of internal token structures by training them on reverse character prediction tasks using the tokenizer's own vocabulary. This method enables models to effectively learn and generalize character positions and internal structures. Experimental results demonstrate that LLMs trained with TIPA outperform baseline models in predicting character positions at the token level. Furthermore, when applied to the downstream task of Chinese Spelling Correction (CSC), TIPA not only accelerates model convergence but also significantly improves task performance.
Related papers
- Enhancing LLM's Cognition via Structurization [41.13997892843677]
Large language models (LLMs) process input contexts through a causal and sequential perspective.
This paper presents a novel concept of context structurization.
Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements.
arXiv Detail & Related papers (2024-07-23T12:33:58Z) - Struct-X: Enhancing Large Language Models Reasoning with Structured Data [38.558614152006975]
Struct-X operates through five key phases: read-model-fill-reflect-reason''
It encodes structured data into a topological space using graph embeddings.
It fills in missing entity information with knowledge retrieval modules.
The final phase involves constructing a topological network with selected tokens.
arXiv Detail & Related papers (2024-07-17T13:06:25Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends.
We show that template and stopword tokens are the most prone to be task-encoding.
Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Knowledgeable Salient Span Mask for Enhancing Language Models as
Knowledge Base [51.55027623439027]
We develop two solutions to help the model learn more knowledge from unstructured text in a fully self-supervised manner.
To our best knowledge, we are the first to explore fully self-supervised learning of knowledge in continual pre-training.
arXiv Detail & Related papers (2022-04-17T12:33:34Z) - Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.