LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images
- URL: http://arxiv.org/abs/2503.16376v1
- Date: Thu, 20 Mar 2025 17:39:06 GMT
- Title: LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images
- Authors: Leyang Wang, Joice Lin,
- Abstract summary: We propose a framework called LaPIG, which enables the construction of high-quality paired visible and thermal images using captions generated by Large Language Models (LLMs)<n>Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information.<n>We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.
Related papers
- Multi-focal Conditioned Latent Diffusion for Person Image Synthesis [59.113899155476005]
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation.
We propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations.
Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information.
arXiv Detail & Related papers (2025-03-19T20:50:10Z) - Image Augmentation Agent for Weakly Supervised Semantic Segmentation [19.654959889052638]
Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels.
We introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective.
IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS.
arXiv Detail & Related papers (2024-12-29T11:32:55Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models [39.21242589835842]
We introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs.<n>To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs.<n>In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets.
arXiv Detail & Related papers (2024-07-30T11:57:40Z) - TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing [23.51498634405422]
We present an innovative image editing framework that employs the robust Chain-of-Thought reasoning and localizing capabilities of multimodal large language models.
Our model exhibits an enhanced ability to understand complex prompts and generate corresponding images, while maintaining high fidelity and consistency in images before and after generation.
arXiv Detail & Related papers (2024-05-27T03:50:37Z) - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - Learned representation-guided diffusion models for large-image generation [58.192263311786824]
We introduce a novel approach that trains diffusion models conditioned on embeddings from self-supervised learning (SSL)
Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images.
Augmenting real data by generating variations of real images improves downstream accuracy for patch-level and larger, image-scale classification tasks.
arXiv Detail & Related papers (2023-12-12T14:45:45Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Interactive Data Synthesis for Systematic Vision Adaptation via
LLMs-AIGCs Collaboration [48.54002313329872]
We propose a new paradigm of annotated data expansion named as ChatGenImage.
The core idea behind it is to leverage the complementary strengths of diverse models to establish a highly effective and user-friendly pipeline for interactive data augmentation.
We present fascinating results obtained from our ChatGenImage framework and demonstrate the powerful potential of our synthetic data for systematic vision adaptation.
arXiv Detail & Related papers (2023-05-22T07:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.