JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
- URL: http://arxiv.org/abs/2403.04798v2
- Date: Tue, 2 Apr 2024 14:52:37 GMT
- Title: JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
- Authors: Arefa, Mohammed Abbas Ansari, Chandni Saxena, Tanvir Ahmad,
- Abstract summary: This paper presents our system development for SemEval-2024 Task 3: "The Competition of Multimodal Emotion Cause Analysis in Conversations"
Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video.
Our proposed approach addresses these challenges by a two-step framework.
- Score: 0.9736758288065405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our system development for SemEval-2024 Task 3: "The Competition of Multimodal Emotion Cause Analysis in Conversations". Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video. However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system. Our proposed approach addresses these challenges by a two-step framework. We adopt two different approaches in our implementation. In Approach 1, we employ instruction-tuning with two separate Llama 2 models for emotion and cause prediction. In Approach 2, we use GPT-4V for conversation-level video description and employ in-context learning with annotated conversation using GPT 3.5. Our system wins rank 4, and system ablation experiments demonstrate that our proposed solutions achieve significant performance gains. All the experimental codes are available on Github.
Related papers
- What Is Missing in Multilingual Visual Reasoning and How to Fix It [64.47951359580556]
We evaluate NLP models' multilingual, multimodal capabilities by testing on a visual reasoning task.
proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison.
Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA by 13.4%.
arXiv Detail & Related papers (2024-03-03T05:45:27Z) - InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models [9.611864685207056]
We propose a novel approach, InstructERC, to reformulate the emotion recognition task from a discriminative framework to a generative framework based on Large Language Models (LLMs)
InstructERC makes three significant contributions: (1) it introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information; (2) we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations; and (3) Pioneeringly, we unify emotion labels across benchmarks through the feeling wheel to fit real application scenarios.
arXiv Detail & Related papers (2023-09-21T09:22:07Z) - On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text.
We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z) - Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue [50.279206765971125]
We explore three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts.
Our best method, scene-dialogue alignment, improves the performance by 20% F1-score compared to the SIMMC 2.1 baselines.
arXiv Detail & Related papers (2023-02-28T15:45:20Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Two-Aspect Information Fusion Model For ABAW4 Multi-task Challenge [41.32053075381269]
The task of ABAW is to predict frame-level emotion descriptors from videos.
We propose a novel end to end architecture to achieve full integration of different types of information.
arXiv Detail & Related papers (2022-07-23T01:48:51Z) - Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation [20.693465164885325]
This paper introduces the schemes of Team LingJing's experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG)
The MDUG task can be divided into two phases: multi-modal context understanding and response generation.
To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task.
arXiv Detail & Related papers (2022-07-05T05:54:20Z) - Multi-Task Learning for Situated Multi-Domain End-to-End Dialogue
Systems [21.55075825370981]
We leverage multi-task learning techniques to train a GPT-2 based model on a more challenging dataset.
Our method achieves better performance on all sub-tasks, across domains, compared to task and domain-specific models.
arXiv Detail & Related papers (2021-10-11T12:36:30Z) - A Unified Pre-training Framework for Conversational AI [25.514505462661763]
PLATO-2 is trained via two-stage curriculum learning to fit the simplified one-to-one mapping relationship.
PLATO-2 has obtained the 1st place in all three tasks, verifying its effectiveness as a unified framework for various dialogue systems.
arXiv Detail & Related papers (2021-05-06T07:27:11Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.