Related papers: CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation

CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation

URL: http://arxiv.org/abs/2407.11433v1
Date: Tue, 16 Jul 2024 06:55:43 GMT
Title: CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation
Authors: Yisen Wang, Yao Teng, Limin Wang,
Abstract summary: We propose a new learning framework, coined as CycleHOI, to boost the performance of human-object interaction (HOI) detection. Our key design is to introduce a novel cycle consistency loss for the training of HOI detector. We perform extensive experiments to verify the effectiveness and generalization power of our CycleHOI.
Score: 37.45945633515955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognition and generation are two fundamental tasks in computer vision, which are often investigated separately in the exiting literature. However, these two tasks are highly correlated in essence as they both require understanding the underline semantics of visual concepts. In this paper, we propose a new learning framework, coined as CycleHOI, to boost the performance of human-object interaction (HOI) detection by bridging the DETR-based detection pipeline and the pre-trained text-to-image diffusion model. Our key design is to introduce a novel cycle consistency loss for the training of HOI detector, which is able to explicitly leverage the knowledge captured in the powerful diffusion model to guide the HOI detector training. Specifically, we build an extra generation task on top of the decoded instance representations from HOI detector to enforce a detection-generation cycle consistency. Moreover, we perform feature distillation from diffusion model to detector encoder to enhance its representation power. In addition, we further utilize the generation power of diffusion model to augment the training set in both aspects of label correction and sample generation. We perform extensive experiments to verify the effectiveness and generalization power of our CycleHOI with three HOI detection frameworks on two public datasets: HICO-DET and V-COCO. The experimental results demonstrate our CycleHOI can significantly improve the performance of the state-of-the-art HOI detectors.

Related papers

Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations [5.5967570276373655]
We systematically analyze the impact of ablating key components in three state-of-the-art detection transformer models.<n>We evaluate the effects of these ablations on the performance metrics gIoU and F1-score.<n>This study advances XAI for DETRs by clarifying the contributions of internal components to model performance.
arXiv Detail & Related papers (2025-07-29T12:00:08Z)
Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems [2.0179223501624786]
This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability.
arXiv Detail & Related papers (2024-12-28T16:24:35Z)
CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information [61.1904164368732]
We propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Experts for each modality to extract cross-modal information from the EEG modality. The framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities.
arXiv Detail & Related papers (2024-12-13T16:27:54Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models. In this paper, we investigate how detection performance varies across model backbones, types, and datasets. We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
Efficient Meta-Learning Enabled Lightweight Multiscale Few-Shot Object Detection in Remote Sensing Images [15.12889076965307]
YOLOv7 one-stage detector is subjected to a novel meta-learning training framework. This transformation allows the detector to adeptly address FSOD tasks while capitalizing on its inherent advantage of lightweight. To validate the effectiveness of our proposed detector, we conducted performance comparisons with current state-of-the-art detectors.
arXiv Detail & Related papers (2024-04-29T04:56:52Z)
D$^3$: Scaling Up Deepfake Detection by Learning from Discrepancy [11.239248133240126]
We seek a step toward a universal deepfake detection system with better generalization and robustness. We propose our Discrepancy Deepfake Detector framework, whose core idea is to learn the universal artifacts from multiple generators. Our framework achieves a 5.3% accuracy improvement in the OOD testing compared to the current SOTA methods while maintaining the ID performance.
arXiv Detail & Related papers (2024-04-06T10:45:02Z)
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets. We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z)
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset [59.445498550159755]
We present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance. We integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer.
arXiv Detail & Related papers (2024-02-08T18:59:53Z)
Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model [22.31860516617302]
We introduce DiffHOI, a novel HOI detection scheme grounded on a pre-trained text-image diffusion model. To fill in the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and high-diversity synthetic dataset. Experiments demonstrate that DiffHOI significantly outperforms the state-of-the-art in regular detection (i.e., 41.50 mAP) and zero-shot detection.
arXiv Detail & Related papers (2023-05-20T17:59:23Z)
Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues. No human annotations are involved in our framework during the whole training process. Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.