MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
- URL: http://arxiv.org/abs/2501.01094v1
- Date: Thu, 02 Jan 2025 06:36:09 GMT
- Title: MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
- Authors: Suhwan Choi, Kyu Won Kim, Myungjoo Kang,
- Abstract summary: We introduce Multimodal Matching based on Valence and Arousal (MMVA)<n>MMVA is a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions.<n>We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values.
- Score: 7.733519760614755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
Related papers
- MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network [6.304608172789466]
MAVEN is a novel architecture for dynamic emotion recognition through dimensional modeling of affect.
Our approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts.
MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex.
arXiv Detail & Related papers (2025-03-16T19:32:32Z) - Interactive Multimodal Fusion with Temporal Modeling [11.506800500772734]
Our approach integrates visual and audio information through a multimodal framework.
The visual branch uses a pre-trained ResNet model to extract features from facial images.
The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals.
Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
arXiv Detail & Related papers (2025-03-13T16:31:56Z) - JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts [8.463489896549161]
Video Action Detection (VAD) involves localizing and categorizing action instances in videos.<n>We introduce a novel multi-modal VAD architecture called the Joint Actor-centric Visual, Audio, Language (JoVALE)
arXiv Detail & Related papers (2024-12-18T10:51:31Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities.
We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems.
Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Emotion-Based End-to-End Matching Between Image and Music in
Valence-Arousal Space [80.49156615923106]
Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger.
Existing emotion-based image and music matching methods either employ limited categorical emotion states or train the matching model using an impractical multi-stage pipeline.
In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space.
arXiv Detail & Related papers (2020-08-22T20:12:23Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [86.40973759048957]
This report describes a multi-modal multi-task ($M3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge.
In the proposed $M3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal.
We evaluated the $M3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
arXiv Detail & Related papers (2020-02-07T18:53:13Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.