Impact of annotation modality on label quality and model performance in
the automatic assessment of laughter in-the-wild
- URL: http://arxiv.org/abs/2211.00794v1
- Date: Wed, 2 Nov 2022 00:18:08 GMT
- Title: Impact of annotation modality on label quality and model performance in
the automatic assessment of laughter in-the-wild
- Authors: Jose Vargas-Quiros, Laura Cabrera-Quiros, Catharine Oertel, Hayley
Hung
- Abstract summary: It is unclear how perception and annotation of laughter differ when annotated from other modalities like video, via the body movements of laughter.
We ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance.
Our analysis of more than 4000 annotations acquired from 48 annotators revealed evidence for incongruity in the perception of laughter, and its intensity between modalities.
- Score: 8.242747994568212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Laughter is considered one of the most overt signals of joy. Laughter is
well-recognized as a multimodal phenomenon but is most commonly detected by
sensing the sound of laughter. It is unclear how perception and annotation of
laughter differ when annotated from other modalities like video, via the body
movements of laughter. In this paper we take a first step in this direction by
asking if and how well laughter can be annotated when only audio, only video
(containing full body movement information) or audiovisual modalities are
available to annotators. We ask whether annotations of laughter are congruent
across modalities, and compare the effect that labeling modality has on machine
learning model performance. We compare annotations and models for laughter
detection, intensity estimation, and segmentation, three tasks common in
previous studies of laughter. Our analysis of more than 4000 annotations
acquired from 48 annotators revealed evidence for incongruity in the perception
of laughter, and its intensity between modalities. Further analysis of
annotations against consolidated audiovisual reference annotations revealed
that recall was lower on average for video when compared to the audio
condition, but tended to increase with the intensity of the laughter samples.
Our machine learning experiments compared the performance of state-of-the-art
unimodal (audio-based, video-based and acceleration-based) and multi-modal
models for different combinations of input modalities, training label modality,
and testing label modality. Models with video and acceleration inputs had
similar performance regardless of training label modality, suggesting that it
may be entirely appropriate to train models for laughter detection from body
movements using video-acquired labels, despite their lower inter-rater
agreement.
Related papers
- A New Perspective on Smiling and Laughter Detection: Intensity Levels
Matter [4.493507573183109]
We present a deep learning-based multimodal smile and laugh classification system.
We compare the use of audio and vision-based models as well as a fusion approach.
We show that, as expected, the fusion leads to a better generalization on unseen data.
arXiv Detail & Related papers (2024-03-04T15:15:57Z) - Laughing Matters: Introducing Laughing-Face Generation using Diffusion
Models [35.688696422879175]
We propose a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter.
We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter.
Our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation.
arXiv Detail & Related papers (2023-05-15T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Repetitive Activity Counting by Sight and Sound [110.36526333035907]
This paper strives for repetitive activity counting in videos.
Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process.
arXiv Detail & Related papers (2021-03-24T11:15:33Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.