Related papers: Yuan 2.0-M32: Mixture of Experts with Attention Router

Yuan 2.0-M32: Mixture of Experts with Attention Router

URL: http://arxiv.org/abs/2405.17976v2
Date: Wed, 29 May 2024 07:19:58 GMT
Title: Yuan 2.0-M32: Mixture of Experts with Attention Router
Authors: Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, Houbo He, Zeru Zhang, Zeyu Sun, Junxiong Mao, Chong Shen,
Abstract summary: Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively.
Score: 30.8849836244273
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github1.

Related papers

Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models [1.7272658301768147]
MoE-MLA-RoPE is a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling.<n>Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations.
arXiv Detail & Related papers (2025-08-02T08:33:30Z)
Energy-Efficient Deep Learning for Traffic Classification on Microcontrollers [1.3124513975412255]
We present a practical deep learning (DL) approach for energy-efficient traffic classification on resource-limited microcontrollers.<n>We develop a lightweight 1D-CNN, optimized via hardware-aware neural architecture search (HW-NAS), which achieves 96.59% accuracy on the ISCX VPN-Non-VPN dataset.<n>We evaluate real-world inference performance on two microcontrollers.
arXiv Detail & Related papers (2025-06-12T16:10:22Z)
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought [196.74837065805488]
Hunyuan-TurboS is a large hybrid Transformer-Mamba Mixture of Experts model.<n>It balances high performance and efficiency, offering substantial capabilities at lower inference costs.
arXiv Detail & Related papers (2025-05-21T12:11:53Z)
Malware Classification from Memory Dumps Using Machine Learning, Transformers, and Large Language Models [1.038088229789127]
This study investigates the performance of various classification models for a malware classification task using different feature sets and data configurations. XGB achieved the highest accuracy of 87.42% using the Top 45 Features, outperforming all other models. Deep learning models underperformed, with RNN achieving 66.71% accuracy and Transformers reaching 71.59%.
arXiv Detail & Related papers (2025-03-04T00:24:21Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent [84.84355125916994]
Hunyuan-Large is the largest open-source Transformer-based mixture of experts model. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature.
arXiv Detail & Related papers (2024-11-04T16:56:26Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [289.9290405258526]
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens. It achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone. We introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
arXiv Detail & Related papers (2024-04-22T14:32:33Z)
Mixtral of Experts [57.411379935325435]
Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model. Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks.
arXiv Detail & Related papers (2024-01-08T18:47:34Z)
Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training [1.5301777464637454]
We propose a novel approach that exploits sparseworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning, and demonstrate the reduction in communication time and memory utilization.
arXiv Detail & Related papers (2023-02-10T04:22:25Z)
Robust Segmentation Models using an Uncertainty Slice Sampling Based Annotation Workflow [5.051373749267151]
We propose an uncertainty slice sampling (USS) strategy for semantic segmentation of 3D medical volumes. We demonstrate the efficiency of USS on a liver segmentation task using multi-site data.
arXiv Detail & Related papers (2021-09-30T06:56:11Z)
Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks [0.0]
In this paper, we utilize ensemble learning to combine linear, nonlinear, and tree-/rule-based machine learning methods. We use the datasets collected for two parallel cancer deep learning CANDLE benchmarks, NT3 and P1B2. We achieve up to 61.15% performance improvement and up to 62.58% energy saving for P1B2 and up to 55.81% performance improvement and up to 52.60% energy saving for NT3 on up to 24,576 cores.
arXiv Detail & Related papers (2020-11-12T21:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.