Efficient Labelling of Affective Video Datasets via Few-Shot &
Multi-Task Contrastive Learning
- URL: http://arxiv.org/abs/2308.02173v1
- Date: Fri, 4 Aug 2023 07:19:08 GMT
- Title: Efficient Labelling of Affective Video Datasets via Few-Shot &
Multi-Task Contrastive Learning
- Authors: Ravikiran Parameshwara, Ibrahim Radwan, Akshay Asthana, Iman
Abbasnejad, Ramanathan Subramanian and Roland Goecke
- Abstract summary: We propose Multi-Task Contrastive Learning for Affect Representation (textbfMT-CLAR) for few-shot affect inference.
MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images.
We extend the image-based MT-CLAR framework for automated video labelling.
- Score: 5.235294751659532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whilst deep learning techniques have achieved excellent emotion prediction,
they still require large amounts of labelled training data, which are (a)
onerous and tedious to compile, and (b) prone to errors and biases. We propose
Multi-Task Contrastive Learning for Affect Representation (\textbf{MT-CLAR})
for few-shot affect inference. MT-CLAR combines multi-task learning with a
Siamese network trained via contrastive learning to infer from a pair of
expressive facial images (a) the (dis)similarity between the facial
expressions, and (b) the difference in valence and arousal levels of the two
faces. We further extend the image-based MT-CLAR framework for automated video
labelling where, given one or a few labelled video frames (termed
\textit{support-set}), MT-CLAR labels the remainder of the video for valence
and arousal. Experiments are performed on the AFEW-VA dataset with multiple
support-set configurations; moreover, supervised learning on representations
learnt via MT-CLAR are used for valence, arousal and categorical emotion
prediction on the AffectNet and AFEW-VA datasets. The results show that valence
and arousal predictions via MT-CLAR are very comparable to the state-of-the-art
(SOTA), and we significantly outperform SOTA with a support-set $\approx$6\%
the size of the video dataset.
Related papers
- eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos [7.011656298079659]
The prevailing use of short videos (SVs) leads to the necessity of emotion recognition in SVs.
Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.
We present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations.
arXiv Detail & Related papers (2023-11-29T03:24:30Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - (Un)likelihood Training for Interpretable Embedding [30.499562324921648]
Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data.
We propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings.
With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search.
arXiv Detail & Related papers (2022-07-01T09:15:02Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems.
We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.