Multi-modal embeddings using multi-task learning for emotion recognition
- URL: http://arxiv.org/abs/2009.05019v1
- Date: Thu, 10 Sep 2020 17:33:16 GMT
- Title: Multi-modal embeddings using multi-task learning for emotion recognition
- Authors: Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
- Abstract summary: General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks.
We extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks.
- Score: 20.973999078271483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General embeddings like word2vec, GloVe and ELMo have shown a lot of success
in natural language tasks. The embeddings are typically extracted from models
that are built on general tasks such as skip-gram models and natural language
generation. In this paper, we extend the work from natural language
understanding to multi-modal architectures that use audio, visual and textual
information for machine learning tasks. The embeddings in our network are
extracted using the encoder of a transformer model trained using multi-task
training. We use person identification and automatic speech recognition as the
tasks in our embedding generation framework. We tune and evaluate the
embeddings on the downstream task of emotion recognition and demonstrate that
on the CMU-MOSEI dataset, the embeddings can be used to improve over previous
state of the art results.
Related papers
- SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Multimodal Representation Learning With Text and Images [2.998895355715139]
This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously.
The learnt representations are evaluated using downstream classification and regression tasks.
arXiv Detail & Related papers (2022-04-30T03:25:01Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - Self-Supervised learning with cross-modal transformers for emotion
recognition [20.973999078271483]
Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language.
In this work, we extend self-supervised training to multi-modal applications.
arXiv Detail & Related papers (2020-11-20T21:38:34Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.