Related papers: A Baseline Multimodal Approach to Emotion Recognition in Conversations

A Baseline Multimodal Approach to Emotion Recognition in Conversations

URL: http://arxiv.org/abs/2602.00914v1
Date: Sat, 31 Jan 2026 21:54:18 GMT
Title: A Baseline Multimodal Approach to Emotion Recognition in Conversations
Authors: Víctor Yeste, Rodrigo Rivas-Arévalo,
Abstract summary: We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends.<n>The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.

Related papers

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation [44.19929499646892]
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm.<n>We present ChatUMM, a conversational unified model that excels at robust context tracking to sustain interleaved multimodal generation.<n>ChatUMM derives its capabilities from an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow.
arXiv Detail & Related papers (2026-02-06T07:11:50Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
PresentAgent: Multimodal Agent for Presentation Video Generation [30.274831875701217]
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos.<n>To achieve this integration, PresentAgent employs a modular pipeline that segments the input document, plans and renders slide-style visual frames.<n>Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models.
arXiv Detail & Related papers (2025-07-05T13:24:15Z)
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition [3.4568313440884837]
We present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework.<n>We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs.<n>We develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimize the process by synchronizing multimodal representation with label descriptions.
arXiv Detail & Related papers (2025-03-25T09:09:30Z)
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion [20.016192628108158]
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding.
arXiv Detail & Related papers (2024-09-26T04:36:19Z)
Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text. Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information. experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
Linguistic Structure Guided Context Modeling for Referring Image Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction. Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.