TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models
- URL: http://arxiv.org/abs/2303.15430v2
- Date: Wed, 29 Mar 2023 04:49:46 GMT
- Title: TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models
- Authors: Md Kamrul Hasan, Md Saiful Islam, Sangwu Lee, Wasifur Rahman, Iftekhar
Naim, Mohammed Ibrahim Khan, Ehsan Hoque
- Abstract summary: We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
- Score: 5.668457303716451
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained large language models have recently achieved ground-breaking
performance in a wide variety of language understanding tasks. However, the
same model can not be applied to multimodal behavior understanding tasks (e.g.,
video sentiment/humor detection) unless non-verbal features (e.g., acoustic and
visual) can be integrated with language. Jointly modeling multiple modalities
significantly increases the model complexity, and makes the training process
data-hungry. While an enormous amount of text data is available via the web,
collecting large-scale multimodal behavioral video datasets is extremely
expensive, both in terms of time and money. In this paper, we investigate
whether large language models alone can successfully incorporate non-verbal
information when they are presented in textual form. We present a way to
convert the acoustic and visual information into corresponding textual
descriptions and concatenate them with the spoken text. We feed this augmented
input to a pre-trained BERT model and fine-tune it on three downstream
multimodal tasks: sentiment, humor, and sarcasm detection. Our approach,
TextMI, significantly reduces model complexity, adds interpretability to the
model's decision, and can be applied for a diverse set of tasks while achieving
superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment
analysis and multimodal humor detection) performance. We propose TextMI as a
general, competitive baseline for multimodal behavioral analysis tasks,
particularly in a low-resource setting.
Related papers
- Textualized and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild [45.29814349246784]
multimodal large language models (LLMs) rely on explicit non-verbal cues that may be translated from different non-textual modalities into text.
This paper compares the potential of text- and feature-based approaches for compound multimodal ER in videos.
arXiv Detail & Related papers (2024-07-17T18:01:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing [17.92378239787507]
We present a decoder-only Discrete Multimodal Language Model (DMLM)
DMLM can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision)
Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training.
arXiv Detail & Related papers (2024-06-04T20:08:25Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.