Related papers: SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

URL: http://arxiv.org/abs/2312.16272v2
Date: Thu, 14 Mar 2024 10:44:49 GMT
Title: SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
Authors: Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, Zhongliang Jing,
Abstract summary: SSR-Encoder is a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks, without necessitating test-time fine-tuning. Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules.
Score: 39.84456803546365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging. Addressing this, we introduce the SSR-Encoder, a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks, without necessitating test-time fine-tuning. The SSR-Encoder combines a Token-to-Patch Aligner that aligns query inputs with image patches and a Detail-Preserving Subject Encoder for extracting and preserving fine features of the subjects, thereby generating subject embeddings. These embeddings, used in conjunction with original text embeddings, condition the generation process. Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules. Enhanced by the Embedding Consistency Regularization Loss for improved training, our extensive experiments demonstrate its effectiveness in versatile and high-quality image generation, indicating its broad applicability. Project page: https://ssr-encoder.github.io

Related papers

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling [54.54513714247062]
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework.<n>We found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions.<n>We propose Self-Adaptive Attention Scaling to dynamically scale the attention activation for each sub-instruction.
arXiv Detail & Related papers (2025-07-22T05:25:38Z)
VISTA-OCR: Towards generative and interactive end to end OCR models [3.7548609506798494]
VISTA-OCR is a lightweight architecture that unifies text detection and recognition within a single generative model.<n>Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase.<n>To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.
arXiv Detail & Related papers (2025-04-04T17:39:53Z)
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning [0.15346678870160887]
This paper introduces a novel approach that integrates features from two distinct CNN based encoders. We also propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder. The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
arXiv Detail & Related papers (2025-02-13T12:54:13Z)
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z)
Exploring Text-Guided Single Image Editing for Remote Sensing Images [30.23541304590692]
This paper proposes a text-guided RSI editing method that is controllable but stable, and can be trained using only a single image. It adopts a multi-scale training approach to preserve consistency without the need for training on extensive benchmark datasets.
arXiv Detail & Related papers (2024-05-09T13:45:04Z)
DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation [38.281805719692194]
In the digital era, QR codes serve as a linchpin connecting virtual and physical realms. prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. This paper introduces Text2QR, a pioneering approach leveraging stable-diffusion models.
arXiv Detail & Related papers (2024-03-11T06:03:31Z)
CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z)
Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images. Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance. We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
RRSR:Reciprocal Reference-based Image Super-Resolution with Progressive Feature Alignment and Selection [66.08293086254851]
We propose a reciprocal learning framework to reinforce the learning of a RefSR network. The newly proposed module aligns reference-input images at multi-scale feature spaces and performs reference-aware feature selection. We empirically show that multiple recent state-of-the-art RefSR models can be consistently improved with our reciprocal learning paradigm.
arXiv Detail & Related papers (2022-11-08T12:39:35Z)
Reference-based Image Super-Resolution with Deformable Attention Transformer [62.71769634254654]
RefSR aims to exploit auxiliary reference (Ref) images to super-resolve low-resolution (LR) images. This paper proposes a deformable attention Transformer, namely DATSR, with multiple scales. Experiments demonstrate that our DATSR achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-07-25T07:07:00Z)
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer [40.04085054791994]
We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process. In experiments, our method achieves state-of-the-art results on conditional image generation.
arXiv Detail & Related papers (2022-06-09T12:25:24Z)
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks. We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image. In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.