Related papers: M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

URL: http://arxiv.org/abs/2311.17963v2
Date: Sat, 13 Apr 2024 04:16:18 GMT
Title: M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
Authors: Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo,
Abstract summary: We propose textbf$M2Chat$, a novel unified multimodal LLM framework for generating interleaved text-image conversation. $M3Adapter$ integrates granular low-level visual information and high-level semantic features from multi-modality prompts. $M3FT$ fine-tuning strategy optimize disjoint groups of parameters for image-text alignment and visual-instruction.
Score: 45.79215260916687
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at \red{https://mattie-e.github.io/M2Chat.github.io}.

Related papers

$β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment [53.42377319350806]
$$-CLIP is a multi-granular text-conditioned contrastive learning framework.<n>$$-CAL addresses the semantic overlap inherent in this hierarchy.<n>$$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence.
arXiv Detail & Related papers (2025-12-14T13:03:20Z)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations [33.11867433769496]
This paper presents a framework that attempts to unify visual understanding and generation within a shared semantic representation.<n>At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.<n> Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
arXiv Detail & Related papers (2025-06-23T17:59:14Z)
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding [42.15416804253783]
Multi-modal Large Language Models (MLLMs) endow large language models with visual comprehension capabilities. How to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. We introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task.
arXiv Detail & Related papers (2025-03-18T11:07:14Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs) Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences. Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z)
Unsupervised Text Style Transfer via LLMs and Attention Masking with Multi-way Interactions [18.64326057581588]
Unsupervised Text Style Transfer (UTST) has emerged as a critical task within the domain of Natural Language Processing (NLP) We propose four ways of interactions, that are pipeline framework with tuned orders; knowledge distillation from Large Language Models (LLMs) to attention masking model; in-context learning with constructed parallel examples. We empirically show these multi-way interactions can improve the baselines in certain perspective of style strength, content preservation and text fluency.
arXiv Detail & Related papers (2024-02-21T09:28:02Z)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z)
Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models. When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z)
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images. Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z)
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z)
Visual Grounding Strategies for Text-Only Natural Language Processing [1.2183405753834562]
multimodal extensions of BERT allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy. A first type of strategy, referred to as it transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call it associative grounding, harnesses image retrieval to match texts with related images during both
arXiv Detail & Related papers (2021-03-25T16:03:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.