DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention
- URL: http://arxiv.org/abs/2309.14327v3
- Date: Wed, 29 Nov 2023 07:52:18 GMT
- Title: DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention
- Authors: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin,
Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
- Abstract summary: DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
- Score: 55.2825684201129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the existing multi-modal models, hindered by their incapacity to
adeptly manage interleaved image-and-text inputs in multi-image, multi-round
dialogues, face substantial constraints in resource allocation for training and
data accessibility, impacting their adaptability and scalability across varied
interaction realms. To address this, we present the DeepSpeed-VisualChat
framework, designed to optimize Large Language Models (LLMs) by incorporating
multi-modal capabilities, with a focus on enhancing the proficiency of Large
Vision and Language Models in handling interleaved inputs. Our framework is
notable for (1) its open-source support for multi-round and multi-image
dialogues, (2) introducing an innovative multi-modal causal attention
mechanism, and (3) utilizing data blending techniques on existing datasets to
assure seamless interactions in multi-round, multi-image conversations.
Compared to existing frameworks, DeepSpeed-VisualChat shows superior
scalability up to 70B parameter language model size, representing a significant
advancement in multi-modal language models and setting a solid foundation for
future explorations.
Related papers
- From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [16.733970553781887]
We propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders.
Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models.
It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple projectors.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [71.40705814904898]
We introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding.
Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space.
arXiv Detail & Related papers (2024-08-09T03:25:42Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Can images help recognize entities? A study of the role of images for
Multimodal NER [20.574849371747685]
Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context.
While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood.
arXiv Detail & Related papers (2020-10-23T23:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.