Related papers: M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

URL: http://arxiv.org/abs/2403.09451v1
Date: Thu, 14 Mar 2024 14:49:40 GMT
Title: M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment
Authors: Long Nguyen-Phuoc, Renald Gaboriau, Dimitri Delacroix, Laurent Navarro,
Abstract summary: This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment. M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment (CLA). M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking. Another notable feature is the model's three specialized branches, each tailored to a specific cognitive load label, enabling nuanced, task-specific analysis. While it shows modest performance compared to the AVCAffe's single-task baseline, M\&M demonstrates a promising framework for integrated multimodal processing. This work paves the way for future enhancements in multimodal-multitask learning systems, emphasizing the fusion of diverse data types for complex task handling.

Related papers

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving [48.17490295484055]
M3Net is a novel network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
arXiv Detail & Related papers (2025-03-23T15:08:09Z)
S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z)
The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework [46.69058301083775]
Multi-Modal Knowledge Graph (MMKG) representation learning framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale. We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking. Our approach achieves SOTA performance across a total of ten datasets, demonstrating its robustness and versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment. Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z)
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques. We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z)
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration. Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z)
D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization [113.72253589338472]
Many-to-many multimodal summarization (M$3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence. We propose a dual knowledge distillation and target-oriented vision modeling framework for the M$3$S task.
arXiv Detail & Related papers (2023-05-22T06:47:35Z)
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model. We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.