Fugu-MT 論文翻訳(概要): ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

論文の概要: ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

arxiv url: http://arxiv.org/abs/2601.21420v1
Date: Thu, 29 Jan 2026 08:58:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.683547
Title: ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
Title（参考訳）: ConceptMoE: 命令型コンピュータ配置のための適応型トークン対コンセプト圧縮
Authors: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang,
Abstract要約: ConceptMoEは意味的に類似したトークンを概念表現に動的にマージする。学習可能なチャンクモジュールは、トークン間の類似度を測定して最適な境界を識別する。 ConceptMoE は言語および視覚言語タスクで標準 MoE を一貫して上回っている。
参考スコア（独自算出の注目度）: 12.503747711792679
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio $R$ before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to $R^2\times$ and KV cache by $R\times$. At $R=2$, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.
Abstract（参考訳）: 大規模言語モデルは全てのトークンに一様計算を割り当て、いくつかのシーケンスは自明に予測可能である一方で、深い推論を必要とするものを無視している。本研究では,意味的に類似したトークンを概念表現に動的にマージし,暗黙的なトークンレベルの計算アロケーションを実行するConceptMoEを紹介する。学習可能なチャンクモジュールは、計算集約的な概念モデルに入る前に、目標比$R$でシーケンスを圧縮し、トークン間の類似度を測定することによって最適な境界を識別する。保存された計算を(注意マップ計算を除く)ベースライン活性化FLOPと総パラメータに合わせるように再配置し、真のアーキテクチャ上の利点を分離する。これらの条件下では、ConceptMoE は言語および視覚言語タスクの標準 MoE を一貫して上回り、言語事前学習の +0.9 点、長い文脈理解の +2.3 点、マルチモーダルベンチマークの +0.6 点を達成している。層ループによる継続トレーニング中にトレーニング済みのMoEを変換する場合、ゲインは+5.5ポイントに達し、実用的な適用性を示す。パフォーマンス以外にも、ConceptMoEは注意計算を最大$R^2\times$、KVキャッシュを$R\times$に減らします。 R=2$で、実験的な測定では、プリフィルスピードアップは175\%に達し、デコードスピードアップは117\%まで長いシーケンスで達成される。最小限のアーキテクチャ変更により、既存のMoEへの直接的な統合が可能になり、適応的な概念レベルの処理が大きな言語モデルの有効性と効率の両方を根本的に改善することを示す。

論文の概要: ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

関連論文リスト