Related papers: Mixture of Parrots: Experts improve memorization more than reasoning

Mixture of Parrots: Experts improve memorization more than reasoning

URL: http://arxiv.org/abs/2410.19034v1
Date: Thu, 24 Oct 2024 17:54:41 GMT
Title: Mixture of Parrots: Experts improve memorization more than reasoning
Authors: Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach,
Abstract summary: We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
Score: 72.445819694797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

Related papers

Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks [11.656636716718175]
compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. We benchmark compressed DeepSeek-R1 models on four different reasoning datasets. We find that parameter count has a much greater impact on LRMs' knowledge than memorization.
arXiv Detail & Related papers (2025-04-02T05:17:46Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference. MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
Complexity Experts are Task-Discriminative Learners for Any Image Restoration [80.46313715427928]
We introduce complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. This preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.
arXiv Detail & Related papers (2024-11-27T15:58:07Z)
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE. We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
Multi-Task Dense Prediction via Mixture of Low-Rank Experts [35.11968315125389]
We present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE) To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Our experiments show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics.
arXiv Detail & Related papers (2024-03-26T14:40:17Z)
LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking [37.24438285812178]
LORS allows stacked modules to share the majority of parameters, requiring a much smaller number of unique ones per module to match or even surpass the performance of using entirely distinct ones. We validate our method by applying it to the stacked decoders of a query-based object detector, and conduct extensive experiments on the widely used MS COCO dataset.
arXiv Detail & Related papers (2024-03-07T08:10:59Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters. Can we retain the advantages of adding more experts without substantially increasing the computational costs? We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.