Self-Supervised learning with cross-modal transformers for emotion
recognition
- URL: http://arxiv.org/abs/2011.10652v1
- Date: Fri, 20 Nov 2020 21:38:34 GMT
- Title: Self-Supervised learning with cross-modal transformers for emotion
recognition
- Authors: Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
- Abstract summary: Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language.
In this work, we extend self-supervised training to multi-modal applications.
- Score: 20.973999078271483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion recognition is a challenging task due to limited availability of
in-the-wild labeled datasets. Self-supervised learning has shown improvements
on tasks with limited labeled datasets in domains like speech and natural
language. Models such as BERT learn to incorporate context in word embeddings,
which translates to improved performance in downstream tasks like question
answering. In this work, we extend self-supervised training to multi-modal
applications. We learn multi-modal representations using a transformer trained
on the masked language modeling task with audio, visual and text features. This
model is fine-tuned on the downstream task of emotion recognition. Our results
on the CMU-MOSEI dataset show that this pre-training technique can improve the
emotion recognition performance by up to 3% compared to the baseline.
Related papers
- Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics [11.88216611522207]
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning.
We achieve this by transforming visual observations into sequences of tokens that a text-pretrained Transformer can ingest and generate.
Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories.
arXiv Detail & Related papers (2024-03-28T17:04:00Z) - Boosting Continuous Emotion Recognition with Self-Pretraining using Masked Autoencoders, Temporal Convolutional Networks, and Transformers [3.951847822557829]
We tackle the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge.
Our study advocates a novel approach aimed at refining continuous emotion recognition.
We achieve this by pre-training with Masked Autoencoders (MAE) on facial datasets, followed by fine-tuning on the aff-wild2 dataset annotated with expression (Expr) labels.
arXiv Detail & Related papers (2024-03-18T03:28:01Z) - EVE: Efficient Vision-Language Pre-training with Masked Prediction and
Modality-Aware MoE [66.48689706116808]
Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task.
Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts.
Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
arXiv Detail & Related papers (2023-08-23T07:36:30Z) - Versatile audio-visual learning for emotion recognition [28.26077129002198]
This study proposes a versatile audio-visual learning framework for handling unimodal and multimodal systems.
We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task.
Notably, VAVL attains a new state-of-the-art performance in the emotional prediction task on the MSP-IMPROV corpus.
arXiv Detail & Related papers (2023-05-12T03:13:37Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Multi-modal embeddings using multi-task learning for emotion recognition [20.973999078271483]
General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks.
We extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks.
arXiv Detail & Related papers (2020-09-10T17:33:16Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.