Related papers: ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

URL: http://arxiv.org/abs/2601.21420v1
Date: Thu, 29 Jan 2026 08:58:22 GMT
Title: ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
Authors: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang,
Abstract summary: ConceptMoE dynamically merges semantically similar tokens into concept representations.<n>A learnable chunk module identifies optimal boundaries by measuring inter-token similarity.<n> ConceptMoE consistently outperforms standard MoE across language and vision-language tasks.
Score: 12.503747711792679
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio $R$ before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to $R^2\times$ and KV cache by $R\times$. At $R=2$, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.

Related papers

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth [23.442686851761298]
AdaPonderLM is a self-supervised recurrent language model that learns token-wise early exiting during pretraining.<n>AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
arXiv Detail & Related papers (2026-03-02T14:28:16Z)
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space [56.37266873329401]
Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density.<n>We propose $textbfDynamic Large Concept Models (DLCM)$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts from tokens to a compressed concept space where reasoning is more efficient.
arXiv Detail & Related papers (2025-12-31T04:19:33Z)
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [61.67090981767583]
We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer.<n>MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking.<n>We also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint.
arXiv Detail & Related papers (2025-07-14T17:49:00Z)
Saliency-driven Dynamic Token Pruning for Large Language Models [32.903622070917194]
Saliency-driven Dynamic Token Pruning (SDTP)<n>A lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state.<n>A ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score.
arXiv Detail & Related papers (2025-04-06T15:15:07Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences. We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z)
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.