Related papers: Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

URL: http://arxiv.org/abs/2502.12214v1
Date: Mon, 17 Feb 2025 04:37:22 GMT
Title: Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement
Authors: Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan,
Abstract summary: We propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method.<n>We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers.<n>Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models.
Score: 43.548042892597536
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.

Related papers

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning [54.99373314906667]
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. As pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. We propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models.
arXiv Detail & Related papers (2025-04-22T16:41:21Z)
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning [13.043393441785392]
We introduce a novel inference framework for on-device Large Language Models (LLMs) fine-tuning adapters. At its core is the Group-Shared Exponents format, which efficiently represents model parameters in integer format using exponents among parameter groups. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage.
arXiv Detail & Related papers (2025-02-18T14:54:55Z)
Recurrent Diffusion for Large-Scale Parameter Generation [52.98888368644455]
We introduce Recurrent Diffusion for Large Scale Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU.<n>RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
arXiv Detail & Related papers (2025-01-20T16:46:26Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers. By treating model parameters as tokens, we replace all the linear projections in Transformers. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Parameter-Efficient Fine-Tuning without Introducing New Latency [7.631596468553607]
We introduce a novel adapter technique that directly applies the adapter to pre-trained parameters instead of the hidden representation. Our proposed method attains a new state-of-the-art outcome in terms of both performance and storage efficiency, storing only 0.03% parameters of full fine-tuning.
arXiv Detail & Related papers (2023-05-26T08:44:42Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Parameter-Efficient Tuning with Special Token Adaptation [25.37998979962568]
PASTA achieves comparable performance to fine-tuning in natural language understanding tasks. Our work demonstrates the pivotal role of special tokens in pretrained language models.
arXiv Detail & Related papers (2022-10-10T01:02:51Z)
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models [19.640997611256168]
We propose AlphaTuning, consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors. We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
arXiv Detail & Related papers (2022-10-08T00:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.