Related papers: Semformer: Transformer Language Models with Semantic Planning

Semformer: Transformer Language Models with Semantic Planning

URL: http://arxiv.org/abs/2409.11143v1
Date: Tue, 17 Sep 2024 12:54:34 GMT
Title: Semformer: Transformer Language Models with Semantic Planning
Authors: Yongjing Yin, Junran Ding, Kai Song, Yue Zhang,
Abstract summary: Next-token prediction serves as the dominant component in current neural language models. We introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response.
Score: 18.750863564495006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.

Related papers

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models [4.7936447642295406]
In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language modelworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigramworks can be found in fully trained language models up to 1B parameters, and theseworks are critical for model performance even when they consist of less than 0.2% of model parameters.
arXiv Detail & Related papers (2025-04-21T22:41:00Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [31.568675300434816]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset. During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one. This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models [43.7024573212373]
We adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead.
arXiv Detail & Related papers (2022-05-30T16:32:30Z)
Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Latent Representation Prediction Networks [0.0]
We find this principle of learning representations unsatisfying. We propose a new way of jointly learning this representation along with the prediction function. Our approach is shown to be more sample-efficient than standard reinforcement learning methods.
arXiv Detail & Related papers (2020-09-20T14:26:03Z)
Train No Evil: Selective Masking for Task-Guided Pre-Training [97.03615486457065]
We propose a three-stage framework by adding a task-guided pre-training stage with selective masking between general pre-training and fine-tuning. We show that our method can achieve comparable or even better performance with less than 50% of cost.
arXiv Detail & Related papers (2020-04-21T03:14:22Z)
Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.