LLMRA: Multi-modal Large Language Model based Restoration Assistant
- URL: http://arxiv.org/abs/2401.11401v1
- Date: Sun, 21 Jan 2024 04:50:19 GMT
- Title: LLMRA: Multi-modal Large Language Model based Restoration Assistant
- Authors: Xiaoyu Jin, Yuan Shi, Bin Xia, Wenming Yang
- Abstract summary: We present a simple MLLM-based Image Restoration framework to address this gap.
We exploit the impressive capabilities of MLLMs to obtain the degradation information for universal image restoration.
Our method leverages image degradation priors from MLLMs, providing low-level attributes descriptions of the input low-quality images and the restored high-quality images simultaneously.
- Score: 25.534022968675337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have a significant impact on
various tasks, due to their extensive knowledge and powerful perception and
generation capabilities. However, it still remains an open research problem on
applying MLLMs to low-level vision tasks. In this paper, we present a simple
MLLM-based Image Restoration framework to address this gap, namely Multi-modal
Large Language Model based Restoration Assistant (LLMRA). We exploit the
impressive capabilities of MLLMs to obtain the degradation information for
universal image restoration. By employing a pretrained multi-modal large
language model and a vision language model, we generate text descriptions and
encode them as context embedding with degradation information for the degraded
image. Through the proposed Context Enhance Module (CEM) and Degradation
Context based Transformer Network (DC-former), we integrate these context
embedding into the restoration network, contributing to more accurate and
adjustable image restoration. Based on the dialogue with the users, our method
leverages image degradation priors from MLLMs, providing low-level attributes
descriptions of the input low-quality images and the restored high-quality
images simultaneously. Extensive experiments demonstrate the superior
performance of our LLMRA in universal image restoration tasks.
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder.
To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework.
Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - MRIR: Integrating Multimodal Insights for Diffusion-based Realistic Image Restoration [17.47612023350466]
We propose MRIR, a diffusion-based restoration method with multimodal insights.
For the textual level, we harness the power of the pre-trained multimodal large language model to infer meaningful semantic information from low-quality images.
For the visual level, we mainly focus on the pixel level control. Thus, we utilize a Pixel-level Processor and ControlNet to control spatial structures.
arXiv Detail & Related papers (2024-07-04T04:55:14Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.