Boosting Human-Object Interaction Detection with Text-to-Image Diffusion
Model
- URL: http://arxiv.org/abs/2305.12252v1
- Date: Sat, 20 May 2023 17:59:23 GMT
- Title: Boosting Human-Object Interaction Detection with Text-to-Image Diffusion
Model
- Authors: Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, Ruimao
Zhang
- Abstract summary: We introduce DiffHOI, a novel HOI detection scheme grounded on a pre-trained text-image diffusion model.
To fill in the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and high-diversity synthetic dataset.
Experiments demonstrate that DiffHOI significantly outperforms the state-of-the-art in regular detection (i.e., 41.50 mAP) and zero-shot detection.
- Score: 22.31860516617302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the problem of the current HOI detection methods and
introduces DiffHOI, a novel HOI detection scheme grounded on a pre-trained
text-image diffusion model, which enhances the detector's performance via
improved data diversity and HOI representation. We demonstrate that the
internal representation space of a frozen text-to-image diffusion model is
highly relevant to verb concepts and their corresponding context. Accordingly,
we propose an adapter-style tuning method to extract the various semantic
associated representation from a frozen diffusion model and CLIP model to
enhance the human and object representations from the pre-trained detector,
further reducing the ambiguity in interaction prediction. Moreover, to fill in
the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and
high-diversity synthetic dataset containing over 140K HOI images with fully
triplet annotations. It is built using an automatic and scalable pipeline
designed to scale up the generation of diverse and high-precision HOI-annotated
data. SynHOI could effectively relieve the long-tail issue in existing datasets
and facilitate learning interaction representations. Extensive experiments
demonstrate that DiffHOI significantly outperforms the state-of-the-art in
regular detection (i.e., 41.50 mAP) and zero-shot detection. Furthermore,
SynHOI can improve the performance of model-agnostic and backbone-agnostic HOI
detection, particularly exhibiting an outstanding 11.55% mAP improvement in
rare classes.
Related papers
- CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation [37.45945633515955]
We propose a new learning framework, coined as CycleHOI, to boost the performance of human-object interaction (HOI) detection.
Our key design is to introduce a novel cycle consistency loss for the training of HOI detector.
We perform extensive experiments to verify the effectiveness and generalization power of our CycleHOI.
arXiv Detail & Related papers (2024-07-16T06:55:43Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - SatDM: Synthesizing Realistic Satellite Image with Semantic Layout
Conditioning using Diffusion Models [0.0]
Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant promise in synthesizing realistic images from semantic layouts.
In this paper, a conditional DDPM model capable of taking a semantic map and generating high-quality, diverse, and correspondingly accurate satellite images is implemented.
The effectiveness of our proposed model is validated using a meticulously labeled dataset introduced within the context of this study.
arXiv Detail & Related papers (2023-09-28T19:39:13Z) - DiffusionEngine: Diffusion Model is Scalable Data Engine for Object
Detection [41.436817746749384]
Diffusion Model is a scalable data engine for object detection.
DiffusionEngine (DE) provides high-quality detection-oriented training pairs in a single stage.
arXiv Detail & Related papers (2023-09-07T17:55:01Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Denoising Diffusion Probabilistic Models for Generation of Realistic
Fully-Annotated Microscopy Image Data Sets [1.07539359851877]
In this study, we demonstrate that diffusion models can effectively generate fully-annotated microscopy image data sets.
The proposed pipeline helps to reduce the reliance on manual annotations when training deep learning-based segmentation approaches.
arXiv Detail & Related papers (2023-01-02T14:17:08Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.