QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
- URL: http://arxiv.org/abs/2411.19534v1
- Date: Fri, 29 Nov 2024 08:20:12 GMT
- Title: QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
- Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek,
- Abstract summary: We tackle the problem of quantifying the number of objects by a generative text-to-image model.
Rather than retraining such a model for each new image domain of interest, we are the first to consider this problem from a domain-agnostic perspective.
We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining.
- Score: 40.661699970360736
- License:
- Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.
Related papers
- Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - Few-Shot Object Detection with Sparse Context Transformers [37.106378859592965]
Few-shot detection is a major task in pattern recognition which seeks to localize objects using models trained with few labeled data.
We propose a novel sparse context transformer (SCT) that effectively leverages object knowledge in the source domain, and automatically learns a sparse context from only few training images in the target domain.
We evaluate the proposed method on two challenging few-shot object detection benchmarks, and empirical results show that the proposed method obtains competitive performance compared to the related state-of-the-art.
arXiv Detail & Related papers (2024-02-14T17:10:01Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - SemAug: Semantically Meaningful Image Augmentations for Object Detection
Through Language Grounding [5.715548995729382]
We propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes.
Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects.
arXiv Detail & Related papers (2022-08-15T19:00:56Z) - Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen
Domains [48.17225008334873]
We propose a feature generative framework integrated with a COntext COnditional Adaptive (COCOA) Batch-Normalization.
The generated visual features better capture the underlying data distribution enabling us to generalize to unseen classes and domains at test-time.
We thoroughly evaluate and analyse our approach on established large-scale benchmark - DomainNet.
arXiv Detail & Related papers (2021-07-15T17:51:16Z) - PixMatch: Unsupervised Domain Adaptation via Pixelwise Consistency
Training [4.336877104987131]
Unsupervised domain adaptation is a promising technique for semantic segmentation.
We present a novel framework for unsupervised domain adaptation based on the notion of target-domain consistency training.
Our approach is simpler, easier to implement, and more memory-efficient during training.
arXiv Detail & Related papers (2021-05-17T19:36:28Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.