Related papers: Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

URL: http://arxiv.org/abs/2512.22741v1
Date: Sun, 28 Dec 2025 01:58:30 GMT
Title: Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis
Authors: Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv,
Abstract summary: This paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA.<n> TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block.<n> TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs.
Score: 1.7522684436505962
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.

Related papers

BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling [51.830134409330704]
Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis.<n>We argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG.<n>We introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance.
arXiv Detail & Related papers (2025-03-04T09:40:00Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection [0.1499944454332829]
This paper introduces Emotion-textbfAware textbfMultimodal Fusion textbfPrompt textbfLtextbfEarning (textbfAMPLE) framework to address the above issue. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data.
arXiv Detail & Related papers (2024-10-21T02:19:24Z)
BlendX: Complex Multi-Intent Detection with Blended Patterns [4.852816974803059]
We present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors. For dataset construction, we utilize both rule-baseds and a generative tool -- OpenAI's ChatGPT -- which is augmented with a similarity-driven strategy for utterance selection. Experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets.
arXiv Detail & Related papers (2024-03-27T06:13:04Z)
AToM: Amortized Text-to-Mesh using 2D Diffusion [107.02696990299032]
Amortized Text-to-Mesh (AToM) is a feed-forward framework optimized across multiple text prompts simultaneously. AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost. AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy.
arXiv Detail & Related papers (2024-02-01T18:59:56Z)
Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z)
Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs [45.41083125321069]
multimodal machine translation (MMT) systems exhibit decreased sensitivity to visual information when text inputs are complete. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process.
arXiv Detail & Related papers (2023-10-26T04:13:49Z)
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing. Our method reduces the speech-text modality gap via a pre-processing stage. We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.