Related papers: JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

URL: http://arxiv.org/abs/2602.00800v1
Date: Sat, 31 Jan 2026 16:15:18 GMT
Title: JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
Authors: Yebin Yang, Huaijin Wu, Fu Guo, Lin Yao, Xiaohan Qin, Jingzhi Wang, Debing Zhang, Junchi Yan,
Abstract summary: We introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables.<n>These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead.<n>Our approach consistently reduces validation loss and significantly improves downstream task performance.
Score: 46.64215658042213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.

Related papers

Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation [46.34608916687127]
Low-Rank Decomposed Scaling (LoRDS) is a unified framework that rethinks quantization granularity through this low-rank decomposition.<n>By "breaking the blocks" of spatial constraints, LoRDS establishes a seamless efficiency lifecycle.<n>LoRDS consistently outperforms state-of-the-art baselines across various model families in both quantization and downstream fine-tuning tasks.
arXiv Detail & Related papers (2026-01-30T08:46:02Z)
EdgeFlex-Transformer: Transformer Inference for Edge Devices [2.1130318406254074]
We propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs)<n>Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning.<n>Experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline.
arXiv Detail & Related papers (2025-12-17T21:45:12Z)
Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models [25.608085561102566]
We introduce Leverage Efficiency (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent.<n>EL is driven by the expert activation ratio and the total compute budget, both following predictable power laws.<n>We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
arXiv Detail & Related papers (2025-07-23T17:10:23Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
QuIC: Quantum-Inspired Compound Adapters for Parameter Efficient Fine-Tuning [0.0]
Scaling full finetuning of large foundation models strains GPU memory and training time.<n>We introduce Quantum-Inspired Compound Adapters (QuIC Adapters)<n>QuIC adapters can effectively finetune a model using less than 0.02% memory footprint of the base model.
arXiv Detail & Related papers (2025-02-10T13:06:56Z)
SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models. SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z)
AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z)
Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.