Related papers: KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework

KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework

URL: http://arxiv.org/abs/2503.04784v3
Date: Thu, 20 Mar 2025 03:04:01 GMT
Title: KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework
Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Yanqin Jia, Zhepeng Li,
Abstract summary: Large language models face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions.<n>This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module, a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module.
Score: 3.5887147125977457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have demonstrated remarkable performance across various tasks, yet they face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions. To address these limitations, a novel framework has been proposed. This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module. The learnable dense residual connections enhance information flow and feature capture across layers. Within the TransformerX module, large convolutional kernels aggregate semantic information from extensive text segments, while smaller convolutions focus on local word order and syntactic structures. The adaptive activation function dynamically adjusts its parameters based on the semantic features of the input text, improving the model's ability to handle diverse semantic expressions and complex relationships. The multitoken prediction module boosts data utilization and accelerates inference by predicting multiple future tokens. These components significantly enhance the performance and efficiency of large language models.

Related papers

Explicit Grammar Semantic Feature Fusion for Robust Text Classification [0.0]
Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently.<n>Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments.<n>Our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model.
arXiv Detail & Related papers (2026-02-24T10:25:29Z)
SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs [3.036179638516407]
We introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, and a semantic reprogramming mechanism that maps patches to task-aware prototypes.<n>This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning.
arXiv Detail & Related papers (2025-06-25T06:40:14Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video) We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
Efficient Transformers with Dynamic Token Pooling [11.28381882347617]
We equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. Results demonstrate that dynamic pooling, which jointly segments and models language, is both faster and more accurate than vanilla Transformers.
arXiv Detail & Related papers (2022-11-17T18:39:23Z)
Pre-Training a Graph Recurrent Network for Language Representation [34.4554387894105]
We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications. We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
arXiv Detail & Related papers (2022-09-08T14:12:15Z)
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data. We propose a dynamic token sparsification framework to prune redundant tokens. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.