Related papers: Semantic Specialization in MoE Appears with Scale: A Study of DeepSeek R1 Expert Specialization

Related papers

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models [55.2480439325792]
We propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another.<n>Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs.<n>We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.
arXiv Detail & Related papers (2025-06-02T19:44:06Z)
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z)
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
We investigate domain specialization and expert redundancy in large-scale MoE models. We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts. Our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z)
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning [31.805726635329595]
We investigate the impact and controllability of DeepSeek-R1's thought length, management of long or confusing contexts, cultural and safety concerns. We show DeepSeek-R1 has a'sweet spot' of reasoning, where extra inference time can impair model performance. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart.
arXiv Detail & Related papers (2025-04-02T00:36:08Z)
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features. Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception. We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z)
Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time [1.1655046053160683]
We present a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs)<n>MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time.
arXiv Detail & Related papers (2025-02-16T12:24:39Z)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.16121855209246]
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.<n>DeepSeek-R1-Zero is trained via large-scale reinforcement learning.<n>DeepSeek-R1 incorporates multi-stage training and cold-start data before RL.
arXiv Detail & Related papers (2025-01-22T15:19:35Z)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning. The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms. Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z)
Attention Heads of Large Language Models: A Survey [10.136767972375639]
We aim to demystify the internal reasoning processes of Large Language Models (LLMs) by systematically exploring the roles and mechanisms of attention heads.<n>We first introduce a novel four-stage framework inspired by the human thought process: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation.<n>We analyze the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free and Modeling-Required methods.
arXiv Detail & Related papers (2024-09-05T17:59:12Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance. MoE architecture could increase the model size without sacrificing computational efficiency. We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z)
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
Dual Path Modeling for Semantic Matching by Perceiving Subtle Conflicts [14.563722352134949]
Transformer-based pre-trained models have achieved great improvements in semantic matching. Existing models still suffer from insufficient ability to capture subtle differences. We propose a novel Dual Path Modeling Framework to enhance the model's ability to perceive subtle differences.
arXiv Detail & Related papers (2023-02-24T09:29:55Z)
Using Deep Mixture-of-Experts to Detect Word Meaning Shift for TempoWiC [0.9543943371833467]
This paper describes the dma submission to the TempoWiC task, which achieves a macro-F1 score of 77.05%. For further improvement, we integrate POS information and word semantic representation using a Mixture-of-Experts (MoE) approach.
arXiv Detail & Related papers (2022-11-07T11:28:34Z)
Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis [0.11470070927586018]
We propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.
arXiv Detail & Related papers (2022-07-15T19:04:43Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
A comprehensive comparative evaluation and analysis of Distributional Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous. We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
When Hearst Is not Enough: Improving Hypernymy Detection from Corpus with Distributional Models [59.46552488974247]
This paper addresses whether an is-a relationship exists between words (x, y) with the help of large textual corpora. Recent studies suggest that pattern-based ones are superior, if large-scale Hearst pairs are extracted and fed, with the sparsity of unseen (x, y) pairs relieved. For the first time, this paper quantifies the non-negligible existence of those specific cases. We also demonstrate that distributional methods are ideal to make up for pattern-based ones in such cases.
arXiv Detail & Related papers (2020-10-10T08:34:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.