Multimodal Conditionality for Natural Language Generation
- URL: http://arxiv.org/abs/2109.01229v1
- Date: Thu, 2 Sep 2021 22:06:07 GMT
- Title: Multimodal Conditionality for Natural Language Generation
- Authors: Michael Sollami and Aashish Jain
- Abstract summary: MAnTiS is a general approach for multimodal conditionality in transformer-based Natural Language Generation models.
We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large scale pretrained language models have demonstrated state-of-the-art
performance in language understanding tasks. Their application has recently
expanded into multimodality learning, leading to improved representations
combining vision and language. However, progress in adapting language models
towards conditional Natural Language Generation (NLG) has been limited to a
single modality, generally text. We propose MAnTiS, Multimodal Adaptation for
Text Synthesis, a general approach for multimodal conditionality in
transformer-based NLG models. In this method, we pass inputs from each modality
through modality-specific encoders, project to textual token space, and finally
join to form a conditionality prefix. We fine-tune the pretrained language
model and encoders with the conditionality prefix guiding the generation. We
apply MAnTiS to the task of product description generation, conditioning a
network on both product images and titles to generate descriptive text. We
demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG
scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS
can generate human quality descriptions consistent with given multimodal
inputs.
Related papers
- OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Multi-Granularity Prediction with Learnable Fusion for Scene Text
Recognition [20.48454415635795]
Scene text recognition (STR) has been an active research topic in computer vision for years.
To tackle this tough problem, numerous innovative methods have been proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend.
In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model.
It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods.
arXiv Detail & Related papers (2023-07-25T04:12:50Z) - Benchmarking Large Language Model Capabilities for Conditional
Generation [15.437176676169997]
We discuss how to adapt existing application-specific generation benchmarks to PLMs.
We show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages.
arXiv Detail & Related papers (2023-06-29T08:59:40Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Multi-Granularity Prediction for Scene Text Recognition [20.48454415635795]
Scene text recognition (STR) has been an active research topic in computer vision for years.
We first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model.
We propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way.
The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level.
arXiv Detail & Related papers (2022-09-08T06:43:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.