OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
- URL: http://arxiv.org/abs/2511.19023v1
- Date: Mon, 24 Nov 2025 11:59:31 GMT
- Title: OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
- Authors: Yuting Gao, Weihao Chen, Lan Wang, Ruihan Xu, Qingpei Guo,
- Abstract summary: We propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences.<n>OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores.<n>This yields a zero-cost, self-supervised preference ordering over generated responses.
- Score: 22.92427011496289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.
Related papers
- GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs [5.1816417820270075]
In this paper, we propose a generative entropy-guided preference modeling approach named GEM for large language models (LLMs)<n>Instead of training a discriminative reward model on preference data, we directly train the LLM to internalize a closed-loop optimization architecture.<n> Experiments on general benchmarks and domain-specific tasks demonstrate that our GEM achieves significant improvements with few-shot preference data.
arXiv Detail & Related papers (2025-11-17T06:04:47Z) - When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets [29.94723846950853]
We present the first comprehensive, data-centric analysis of popular open-source DPO corpora.<n>We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward.<n>This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins.
arXiv Detail & Related papers (2025-11-14T06:12:16Z) - Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization [56.97588709890706]
LongMab-PO is a novel framework that generates high-quality and diverse responses for long-context modeling tasks.<n> Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs.
arXiv Detail & Related papers (2025-08-19T16:33:55Z) - Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z) - In-context Ranking Preference Optimization [65.5489745857577]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference.<n>We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and
Generative Fusion [33.73671362609599]
Our framework consists of two modules: PairRanker and GenFuser.
PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs.
GenFuser aims to merge the top-ranked candidates, generating an improved output.
arXiv Detail & Related papers (2023-06-05T03:32:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.