Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
- URL: http://arxiv.org/abs/2409.18971v1
- Date: Thu, 12 Sep 2024 05:05:34 GMT
- Title: Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
- Authors: Mengying Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang Song, Tao Zhang,
- Abstract summary: We present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024)
To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy.
Our model ranks textbf2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
- Score: 9.378013909890374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples [18.29910296652917]
We present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI)
This challenge tackles the issue of limited annotated data in emotion recognition.
Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard.
arXiv Detail & Related papers (2024-08-23T11:33:54Z) - Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model [51.42415340921237]
We propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to fuse the two modalities (audio and textual)
We further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio.
arXiv Detail & Related papers (2021-07-04T08:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.