Related papers: GameVibe: A Multimodal Affective Game Corpus

GameVibe: A Multimodal Affective Game Corpus

URL: http://arxiv.org/abs/2407.12787v1
Date: Mon, 17 Jun 2024 10:52:52 GMT
Title: GameVibe: A Multimodal Affective Game Corpus
Authors: Matthew Barthet, Maria Kaselimi, Kosmas Pinitas, Konstantinos Makantasis, Antonios Liapis, Georgios N. Yannakakis,
Abstract summary: We present GameVibe, a novel affect corpus which consists of multimodal audiovisual stimuli. The corpus consists of videos from a diverse set of publicly available gameplay sessions across 30 games.
Score: 4.846739905880406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As online video and streaming platforms continue to grow, affective computing research has undergone a shift towards more complex studies involving multiple modalities. However, there is still a lack of readily available datasets with high-quality audiovisual stimuli. In this paper, we present GameVibe, a novel affect corpus which consists of multimodal audiovisual stimuli, including in-game behavioural observations and third-person affect labels for viewer engagement. The corpus consists of videos from a diverse set of publicly available gameplay sessions across 30 games, with particular attention to ensure high-quality stimuli with good audiovisual and gameplay diversity. Furthermore, we present an analysis on the reliability of the annotators in terms of inter-annotator agreement.

Related papers

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence [13.2968942989609]
We focus on unsupervised video highlight detection, eliminating the need for manual annotations. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video. We also compute visual pseudo-highlight scores for each video using visual features.
arXiv Detail & Related papers (2024-07-18T23:09:14Z)
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z)
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z)
CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective [30.995357472421404]
Video Saliency Prediction (VSP) imitates the selective attention mechanism of human brain. Most VSP methods exploit semantic correlation between vision and audio modalities but ignore the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed.
arXiv Detail & Related papers (2023-03-11T09:29:57Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses [0.0]
We present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual.
arXiv Detail & Related papers (2022-02-19T07:36:43Z)
APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES) APES contains over 1.9K identities labeled along 36 hours of video. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z)
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network. We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both. We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously. Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.