DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention
- URL: http://arxiv.org/abs/2309.14327v3
- Date: Wed, 29 Nov 2023 07:52:18 GMT
- Title: DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention
- Authors: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin,
Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
- Abstract summary: DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
- Score: 55.2825684201129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the existing multi-modal models, hindered by their incapacity to
adeptly manage interleaved image-and-text inputs in multi-image, multi-round
dialogues, face substantial constraints in resource allocation for training and
data accessibility, impacting their adaptability and scalability across varied
interaction realms. To address this, we present the DeepSpeed-VisualChat
framework, designed to optimize Large Language Models (LLMs) by incorporating
multi-modal capabilities, with a focus on enhancing the proficiency of Large
Vision and Language Models in handling interleaved inputs. Our framework is
notable for (1) its open-source support for multi-round and multi-image
dialogues, (2) introducing an innovative multi-modal causal attention
mechanism, and (3) utilizing data blending techniques on existing datasets to
assure seamless interactions in multi-round, multi-image conversations.
Compared to existing frameworks, DeepSpeed-VisualChat shows superior
scalability up to 70B parameter language model size, representing a significant
advancement in multi-modal language models and setting a solid foundation for
future explorations.
Related papers
- S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in
Language Models [25.058162782167503]
Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions.
We introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts.
Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT.
arXiv Detail & Related papers (2023-12-14T09:13:09Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Can images help recognize entities? A study of the role of images for
Multimodal NER [20.574849371747685]
Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context.
While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood.
arXiv Detail & Related papers (2020-10-23T23:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.