MGD-GAN: Text-to-Pedestrian generation through Multi-Grained
Discrimination
- URL: http://arxiv.org/abs/2010.00947v1
- Date: Fri, 2 Oct 2020 12:24:48 GMT
- Title: MGD-GAN: Text-to-Pedestrian generation through Multi-Grained
Discrimination
- Authors: Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu
- Abstract summary: We propose the Multi-Grained Discrimination enhanced Generative Adversarial Network, that capitalizes a human-part-based Discriminator and a self-cross-attended Discriminator.
A fine-grained word-level attention mechanism is employed in the HPD module to enforce diversified appearance and vivid details.
The substantial improvement over the various metrics demonstrates the efficacy of MGD-GAN on the text-to-pedestrian synthesis scenario.
- Score: 96.91091607251526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the problem of text-to-pedestrian synthesis,
which has many potential applications in art, design, and video surveillance.
Existing methods for text-to-bird/flower synthesis are still far from solving
this fine-grained image generation problem, due to the complex structure and
heterogeneous appearance that the pedestrians naturally take on. To this end,
we propose the Multi-Grained Discrimination enhanced Generative Adversarial
Network, that capitalizes a human-part-based Discriminator (HPD) and a
self-cross-attended (SCA) global Discriminator in order to capture the
coherence of the complex body structure. A fined-grained word-level attention
mechanism is employed in the HPD module to enforce diversified appearance and
vivid details. In addition, two pedestrian generation metrics, named Pose Score
and Pose Variance, are devised to evaluate the generation quality and
diversity, respectively. We conduct extensive experiments and ablation studies
on the caption-annotated pedestrian dataset, CUHK Person Description Dataset.
The substantial improvement over the various metrics demonstrates the efficacy
of MGD-GAN on the text-to-pedestrian synthesis scenario.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources.
We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs.
Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z) - Controllable Face Synthesis with Semantic Latent Diffusion Models [6.438244172631555]
We propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing.
The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face.
arXiv Detail & Related papers (2024-03-19T14:02:13Z) - GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion
Generation [23.435588151215594]
We propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis.
The framework exploits a strategy named GradUally Enriching SyntheSis (GUESS) as its abbreviation.
We show that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity.
arXiv Detail & Related papers (2024-01-04T08:48:21Z) - Syntax-Guided Transformers: Elevating Compositional Generalization and
Grounding in Multimodal Environments [20.70294450587676]
We exploit the syntactic structure of language to boost compositional generalization.
We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem.
The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling.
arXiv Detail & Related papers (2023-11-07T21:59:16Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrase Detection: Human vs. Machine Content [3.8768839735240737]
Human-authored paraphrases exceed machine-generated ones in terms of difficulty, diversity, and similarity.
Transformers emerged as the most effective method across datasets with TF-IDF excelling on semantically diverse corpora.
arXiv Detail & Related papers (2023-03-24T13:25:46Z) - DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation [71.87682778102236]
We propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture.
DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks.
arXiv Detail & Related papers (2022-09-03T06:13:26Z) - Multi-Attributed and Structured Text-to-Face Synthesis [1.3381749415517017]
Generative Adrial Networks (GANs) have revolutionized image synthesis through many applications like face generation, photograph editing, and image super-resolution.
This paper empirically proves that increasing the number of facial attributes in each textual description helps GANs generate more diverse and real-looking faces.
arXiv Detail & Related papers (2021-08-25T07:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.