Related papers: LLMRA: Multi-modal Large Language Model based Restoration Assistant

LLMRA: Multi-modal Large Language Model based Restoration Assistant

URL: http://arxiv.org/abs/2401.11401v1
Date: Sun, 21 Jan 2024 04:50:19 GMT
Title: LLMRA: Multi-modal Large Language Model based Restoration Assistant
Authors: Xiaoyu Jin, Yuan Shi, Bin Xia, Wenming Yang
Abstract summary: We present a simple MLLM-based Image Restoration framework to address this gap. We exploit the impressive capabilities of MLLMs to obtain the degradation information for universal image restoration. Our method leverages image degradation priors from MLLMs, providing low-level attributes descriptions of the input low-quality images and the restored high-quality images simultaneously.
Score: 25.534022968675337
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Large Language Models (MLLMs) have a significant impact on various tasks, due to their extensive knowledge and powerful perception and generation capabilities. However, it still remains an open research problem on applying MLLMs to low-level vision tasks. In this paper, we present a simple MLLM-based Image Restoration framework to address this gap, namely Multi-modal Large Language Model based Restoration Assistant (LLMRA). We exploit the impressive capabilities of MLLMs to obtain the degradation information for universal image restoration. By employing a pretrained multi-modal large language model and a vision language model, we generate text descriptions and encode them as context embedding with degradation information for the degraded image. Through the proposed Context Enhance Module (CEM) and Degradation Context based Transformer Network (DC-former), we integrate these context embedding into the restoration network, contributing to more accurate and adjustable image restoration. Based on the dialogue with the users, our method leverages image degradation priors from MLLMs, providing low-level attributes descriptions of the input low-quality images and the restored high-quality images simultaneously. Extensive experiments demonstrate the superior performance of our LLMRA in universal image restoration tasks.

Related papers

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder. To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework. Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
MRIR: Integrating Multimodal Insights for Diffusion-based Realistic Image Restoration [17.47612023350466]
We propose MRIR, a diffusion-based restoration method with multimodal insights. For the textual level, we harness the power of the pre-trained multimodal large language model to infer meaningful semantic information from low-quality images. For the visual level, we mainly focus on the pixel level control. Thus, we utilize a Pixel-level Processor and ControlNet to control spatial structures.
arXiv Detail & Related papers (2024-07-04T04:55:14Z)
From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.