Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations
- URL: http://arxiv.org/abs/2407.03604v1
- Date: Thu, 4 Jul 2024 03:28:22 GMT
- Title: Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations
- Authors: Zhiyang Xu, Minqian Liu, Ying Shen, Joy Rimchala, Jiaxin Zhang, Qifan Wang, Yu Cheng, Lifu Huang,
- Abstract summary: We introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains.
We propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization.
We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset.
- Score: 45.800383191637785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Vision-Language Models (VLMs) have led to the development of Vision-Language Generalists (VLGs) capable of understanding and generating interleaved images and text. Despite these advances, VLGs still struggle to follow user instructions for interleaved text and image generation. To address this issue, we introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains. Due to the extensive size of existing VLGs, we opt for parameter-efficient tuning. However, we observe that VLGs tuned with a standard LoRA typically exhibit inferior performance in interleaved text-image generation. We attribute this problem to modality interference and the lack of modality-specialized adaptation design. Hence, we propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization. Lateralization LoRA employs a hybrid approach, combining the traditional linear LoRA and a Convolutional LoRA for generating text and images, enabling the generation of high-quality text and images by leveraging modality-specific structures and parameter sets. We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset. Extensive experiments demonstrate that EMU2 tuned with Lateralization LoRA achieve state-of-the-art performance, significantly surpassing baseline models in complex interleaved tasks.
Related papers
- Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models [0.7366405857677226]
Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy.
VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features.
It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
arXiv Detail & Related papers (2025-01-01T18:27:13Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.
It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)
It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.
Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.
Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair [116.48684498656871]
We propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs.
We learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model.
Our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.
arXiv Detail & Related papers (2024-11-28T13:55:06Z) - Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD)
SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image.
Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - MAGVLT: Masked Generative Vision-and-Language Transformer [15.796199345773879]
We explore a unified generative vision-and-language model that can produce both images and text sequences.
We propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT)
For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks.
arXiv Detail & Related papers (2023-03-21T21:49:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.