MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition
- URL: http://arxiv.org/abs/2502.12478v1
- Date: Tue, 18 Feb 2025 03:06:29 GMT
- Title: MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition
- Authors: Yang Yang, Xunde Dong, Yupeng Qiang,
- Abstract summary: Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations.
We propose textbfMultimodal textbfSentiment Analysis and textbfAdapter (MSE-Adapter) a lightweight and adaptable plugin.
- Score: 3.2899630403451985
- License:
- Abstract: Current Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose \textbf{M}ultimodal \textbf{S}entiment Analysis and \textbf{E}motion Recognition \textbf{Adapter} (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin. The code will be released on GitHub after a blind review.
Related papers
- Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs)
Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z) - Liquid: Language Models are Scalable Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.
Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model.
For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Language Portability Strategies for Open-domain Dialogue with Pre-trained Language Models from High to Low Resource Languages [1.7436854281619139]
We propose a study of linguistic portability strategies of large pre-trained language models (PLMs) used for open-domain dialogue systems.
In particular the target low-resource language (L_T) will be simulated with French, as it lacks of task-specific resources.
arXiv Detail & Related papers (2024-07-01T14:20:54Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - Towards Versatile and Efficient Visual Knowledge Integration into
Pre-trained Language Models with Cross-Modal Adapters [16.44174900423759]
We propose a new plug-and-play module, X-adapter, to leverage the aligned visual and textual knowledge learned in pre-trained vision-language models.
Our method can significantly improve the performance on object-color reasoning and natural language understanding tasks.
arXiv Detail & Related papers (2023-05-12T10:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.