THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation
- URL: http://arxiv.org/abs/2505.14173v1
- Date: Tue, 20 May 2025 10:27:19 GMT
- Title: THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation
- Authors: Yunlong Liang, Fandong Meng, Jie Zhou,
- Abstract summary: We propose THOR-MoE to arm the MoE with hierarchical task-guided and context-responsive routing policies.<n> THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$citeshazeer 2017 and Top-$p$citehuang-etal-2024-harder routing schemes.<n>For instance, compared with vanilla Top-$p$citehuang-etal-2024-harder routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22% activated parameters
- Score: 80.25152370613186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.
Related papers
- Mixture-of-Experts Meets In-Context Reinforcement Learning [29.866936147753368]
In this paper, we introduce textbfT2MIR (textbfToken- and textbfTask-wise textbfMoE for textbfIn-context textbfRL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.<n> Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z) - Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation [1.9639956888747314]
This paper contributes to artificial intelligence by proposing two approaches for adapting large language models (msLLMs)<n>As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings.<n>Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline.
arXiv Detail & Related papers (2025-03-28T16:30:28Z) - IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames.
Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability.
We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z) - Glider: Global and Local Instruction-Driven Expert Router [83.785832410832]
"Model MoErging" methods prioritize generalization to unseen tasks at the expense of performance on held-in tasks.
We propose Global and Local Instruction Driven Expert Router (GLIDER) that integrates a multi-scale routing mechanism.
GLIDER achieves substantially improved held-in performance while maintaining strong generalization on held-out tasks.
arXiv Detail & Related papers (2024-10-09T17:59:14Z) - Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing [8.36121848069236]
Mixture of Experts (MoE) is a promising approach for handling code-switching speech recognition (CS-ASR) tasks.<n>This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task.<n>It supports different top-$k$ inference and streaming capabilities and can also prune the model parameters flexibly to obtain a monolingual sub-model.
arXiv Detail & Related papers (2024-07-26T08:03:07Z) - One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts [110.94724216491753]
Large Language Models (LLMs) exhibit strong generalization capabilities when prompted with language instructions and in-context demos.
Various methods have been explored to automate the instruction design, but they restricted the searched prompt to one instruction.
We adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions.
A two-phase process is developed to construct the specialized expert for each region.
A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect.
arXiv Detail & Related papers (2024-06-28T23:05:08Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.