MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
- URL: http://arxiv.org/abs/2501.01094v1
- Date: Thu, 02 Jan 2025 06:36:09 GMT
- Title: MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
- Authors: Suhwan Choi, Kyu Won Kim, Myungjoo Kang,
- Abstract summary: We introduce Multimodal Matching based on Valence and Arousal (MMVA)
MMVA is a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions.
We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values.
- Score: 7.733519760614755
- License:
- Abstract: We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
Related papers
- JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts [8.463489896549161]
Video Action Detection (VAD) entails localizing and categorizing action instances within videos.
We introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language (JoVALE)
JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models.
arXiv Detail & Related papers (2024-12-18T10:51:31Z) - MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers [18.72489078928417]
We propose a novel model for the combination of audio-image and video modalities.
This model employs an intuitive approach for the combination of audio-image and video modalities.
Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
arXiv Detail & Related papers (2023-08-01T11:00:25Z) - Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities.
We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems.
Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Emotion-Based End-to-End Matching Between Image and Music in
Valence-Arousal Space [80.49156615923106]
Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger.
Existing emotion-based image and music matching methods either employ limited categorical emotion states or train the matching model using an impractical multi-stage pipeline.
In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space.
arXiv Detail & Related papers (2020-08-22T20:12:23Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [86.40973759048957]
This report describes a multi-modal multi-task ($M3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge.
In the proposed $M3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal.
We evaluated the $M3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
arXiv Detail & Related papers (2020-02-07T18:53:13Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.