Monet: Mixture of Monosemantic Experts for Transformers
- URL: http://arxiv.org/abs/2412.04139v2
- Date: Mon, 09 Dec 2024 07:49:01 GMT
- Title: Monet: Mixture of Monosemantic Experts for Transformers
- Authors: Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang,
- Abstract summary: We introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture.
Monet incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining.
Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts.
- Score: 33.8311330578753
- License:
- Abstract: Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.
Related papers
- Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time [1.1655046053160683]
We present a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs)
MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time.
arXiv Detail & Related papers (2025-02-16T12:24:39Z) - More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing [5.846028298833611]
Conditionally Overlapping Mixture of ExperTs (COMET) is a general deep learning method that inducing a modular, sparse architecture with an exponential number of overlapping experts.
We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression.
arXiv Detail & Related papers (2024-10-10T14:58:18Z) - LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities.
If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information.
To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z) - Robust and Scalable Model Editing for Large Language Models [75.95623066605259]
We propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing.
Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs.
arXiv Detail & Related papers (2024-03-26T06:57:23Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks.
We propose M-SMoE, which leverages routing statistics to guide expert merging.
Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z) - Implicit Mixture of Interpretable Experts for Global and Local
Interpretability [0.0]
We investigate the feasibility of using mixtures of interpretable experts (MoIE) to build interpretable image classifiers on MNIST10.
We find that a naively trained MoIE will learn to 'cheat', whereby the black-box router will solve the classification problem by itself.
We propose a novel implicit parameterization scheme that allows us to build mixtures of arbitrary numbers of experts.
arXiv Detail & Related papers (2022-12-01T12:54:42Z) - Interpretable Mixture of Experts [71.55701784196253]
Interpretable Mixture of Experts (IME) is an inherently-interpretable modeling framework.
IME is demonstrated to be more accurate than single interpretable models and perform comparably with existing state-of-the-art Deep Neural Networks (DNNs) in accuracy.
IME's explanations are compared to commonly-used post-hoc explanations methods through a user study.
arXiv Detail & Related papers (2022-06-05T06:40:15Z) - Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability [3.021134753248103]
Sparsely-gated Mixture of Expert (MoE) layers have been successfully applied for scaling large transformers.
In this work, we apply sparse MoE layers to CNNs for computer vision tasks and analyze the resulting effect on model interpretability.
arXiv Detail & Related papers (2022-04-22T09:40:23Z) - LAP: An Attention-Based Module for Concept Based Self-Interpretation and
Knowledge Injection in Convolutional Neural Networks [2.8948274245812327]
We propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability.
LAP is easily pluggable into any convolutional neural network, even the already trained ones.
LAP offers more valid human-understandable and faithful-to-the-model interpretations than the commonly used white-box explainer methods.
arXiv Detail & Related papers (2022-01-27T21:10:20Z) - DEMix Layers: Disentangling Domains for Modular Language Modeling [92.57761975953453]
We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text.
A DEMix layer is a collection of expert feedforward networks, each specialized to a domain.
Experiments show that DEMix layers reduce test-time perplexity, increase training efficiency, and enable rapid adaptation with little overhead.
arXiv Detail & Related papers (2021-08-11T05:15:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.