Related papers: More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment

More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment

URL: http://arxiv.org/abs/2508.06036v1
Date: Fri, 08 Aug 2025 05:44:26 GMT
Title: More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment
Authors: Jun Xie, Yingjian Zhu, Feng Chen, Zhenghao Zhang, Xiaohui Fan, Hongzhu Yi, Xinming Wang, Chen Yu, Yue Bi, Zhaoran Zhao, Xiongjun Guan, Zhepeng Wang,
Abstract summary: We present our solution for the semi-supervised learning track (MER-SEMI) in MER2025.<n>We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system.<n>Our approach integrates a diverse range of input modalities as independent experts.
Score: 24.56511209071154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.

Related papers

Criteria-Based LLM Relevance Judgments [5.478764356647438]
Large Language Models (LLMs) provide a scalable solution by generating relevance labels directly through prompting.<n>We propose the Multi-Criteria framework for LLM-based relevance judgments, decomposing the notion of relevance into multiple criteria.<n>Our results demonstrate that Multi-Criteria judgments enhance the system ranking/leaderboard performance.
arXiv Detail & Related papers (2025-07-13T04:21:21Z)
MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations [5.4482836906033585]
We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality.<n>Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning.<n>We introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies.
arXiv Detail & Related papers (2025-05-24T08:43:42Z)
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning [59.56171041796373]
We harvest multi-modal instructional data in a robust and efficient manner.<n>We take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns.<n>Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods.
arXiv Detail & Related papers (2025-03-17T17:11:22Z)
Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Audio-Guided Fusion Techniques for Multimodal Emotion Analysis [2.7013910991626213]
We propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. We fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. We also propose an Audio-Guided Transformer (AGT) fusion mechanism, showing superior effectiveness in fusing both inter-channel and intra-channel information.
arXiv Detail & Related papers (2024-09-08T07:28:27Z)
Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples [18.29910296652917]
We present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI) This challenge tackles the issue of limited annotated data in emotion recognition. Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard.
arXiv Detail & Related papers (2024-08-23T11:33:54Z)
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition [65.19303535139453]
We present our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V.
arXiv Detail & Related papers (2024-08-20T02:46:03Z)
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation [66.92696817276288]
SemiRES is a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarcation. In instances where a precise mask cannot be matched from the available candidates, we develop the Pixel-Wise Adjustment (PWA) strategy.
arXiv Detail & Related papers (2024-06-03T15:42:30Z)
A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition [74.79785063365289]
Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets. We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
arXiv Detail & Related papers (2023-05-21T15:31:23Z)
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning [90.17500229142755]
The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia. This paper introduces the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants. We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.
arXiv Detail & Related papers (2023-04-18T13:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.