Related papers: Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

URL: http://arxiv.org/abs/2512.24617v2
Date: Mon, 05 Jan 2026 05:44:29 GMT
Title: Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Authors: Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang,
Abstract summary: Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density.<n>We propose $textbfDynamic Large Concept Models (DLCM)$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts from tokens to a compressed concept space where reasoning is more efficient.
Score: 56.37266873329401
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$\%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.

Related papers

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation [12.503747711792679]
ConceptMoE dynamically merges semantically similar tokens into concept representations.<n>A learnable chunk module identifies optimal boundaries by measuring inter-token similarity.<n> ConceptMoE consistently outperforms standard MoE across language and vision-language tasks.
arXiv Detail & Related papers (2026-01-29T08:58:22Z)
Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z)
Bound by semanticity: universal laws governing the generalization-identification tradeoff [8.437463955457423]
We show that finite-resolution similarity is a fundamental emergent informational constraint, not merely a toy-model artifact.<n>These results provide an exact theory of the generalization-identification trade-off and clarify how semantic resolution shapes the representational capacity of deep networks and brains alike.
arXiv Detail & Related papers (2025-06-01T15:56:26Z)
Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion [56.12939353271623]
From a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state.<n>We find this synchronicity is essentially the Compression Hacking'' in LM representations.<n>We propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline.
arXiv Detail & Related papers (2025-05-23T12:11:03Z)
Saliency-driven Dynamic Token Pruning for Large Language Models [32.903622070917194]
Saliency-driven Dynamic Token Pruning (SDTP)<n>A lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state.<n>A ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score.
arXiv Detail & Related papers (2025-04-06T15:15:07Z)
UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model [62.66515621965686]
We introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion.<n>This D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input.<n>We construct UniF$2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs.
arXiv Detail & Related papers (2025-03-11T07:34:59Z)
Scaling Embedding Layers in Language Models [61.939921364422936]
$SCONE$ is a new method for extending input embedding layers to enhance language model performance.<n>$SCONE$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams.<n>These embeddings provide contextualized representation for each input token and are learned with a separate model during training.<n>$SCONE$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference.
arXiv Detail & Related papers (2025-02-03T18:59:32Z)
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences. We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z)
Densely Connected $G$-invariant Deep Neural Networks with Signed Permutation Representations [6.200483285433661]
We introduce and investigate, for finite groups $G$, $G$-invariant deep neural network ($G$-DNN) architectures with ReLU activation. The preactivations of the $G$-DNNs are able to transform by emphsigned permutation representations (signed perm-reps) of $G$. We show that there are far more admissible $G$-DNN architectures than those accessible with the concatenated ReLU'' activation function from the literature.
arXiv Detail & Related papers (2023-03-08T14:35:03Z)
Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.