Related papers: LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

URL: http://arxiv.org/abs/2411.09293v1
Date: Thu, 14 Nov 2024 09:12:18 GMT
Title: LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution
Authors: Chenyang Wang, Wenjie An, Kui Jiang, Xianming Liu, Junjun Jiang,
Abstract summary: We propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
Score: 67.23699927053191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.

Related papers

Vision-Language Model Guided Image Restoration [16.151927651999948]
Vision-language models (VLMs) excel at aligning visual and textual features into universal image restoration.<n>We propose the Vision-Language Model Guided Image Restoration (VLMIR) framework to enhance IR performance through improved visual perception and semantic understanding.<n>Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration.
arXiv Detail & Related papers (2025-12-19T07:16:07Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training [78.60953331455565]
PRIOR is a vision-language pre-training approach that prioritizes image-related tokens through differential weighting in the NTP loss.<n>We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP.
arXiv Detail & Related papers (2025-05-13T21:27:52Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS) We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer [40.47880613758304]
We propose a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs) Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. Given unlabelled facial data, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.
arXiv Detail & Related papers (2024-05-29T14:06:09Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework. Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z)
Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning [0.0]
We introduce text-guided face recognition (TGFR) to analyze the impact of integrating facial attributes in the form of natural language descriptions. TGFR demonstrates remarkable improvements, particularly on low-quality images, over existing face recognition models.
arXiv Detail & Related papers (2023-12-14T22:04:22Z)
CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z)
Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images. Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance. We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training. VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
Unsupervised Image-to-Image Translation with Generative Prior [103.54337984566877]
Unsupervised image-to-image translation aims to learn the translation between two visual domains without paired data. We present a novel framework, Generative Prior-guided UN Image-to-image Translation (GP-UNIT), to improve the overall quality and applicability of the translation algorithm.
arXiv Detail & Related papers (2022-04-07T17:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.