SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
- URL: http://arxiv.org/abs/2503.00414v1
- Date: Sat, 01 Mar 2025 09:26:05 GMT
- Title: SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
- Authors: Xin Lin, Chong Shi, Zuopeng Yang, Haojin Tang, Zhili Zhou,
- Abstract summary: Recent open-vocabulary human-object interaction (OV-HOI) detection methods rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories.<n>Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities.
- Score: 16.89965584177711
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on https://github.com/Phil0212/SGC-Net.
Related papers
- Bidirectional Logits Tree: Pursuing Granularity Reconcilement in Fine-Grained Classification [89.20477310885731]
This paper addresses the challenge of Granularity Competition in fine-grained classification tasks.<n>Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder.<n>We propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement.
arXiv Detail & Related papers (2024-12-17T10:42:19Z) - Collaborative Feature-Logits Contrastive Learning for Open-Set Semi-Supervised Object Detection [75.02249869573994]
In open-set scenarios, the unlabeled dataset contains both in-distribution (ID) classes and out-of-distribution (OOD) classes.<n>Applying semi-supervised detectors in such settings can lead to misclassifying OOD class as ID classes.<n>We propose a simple yet effective method, termed Collaborative Feature-Logits Detector (CFL-Detector)
arXiv Detail & Related papers (2024-11-20T02:57:35Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z) - GEN-VLKT: Simplify Association and Enhance Interaction Understanding for
HOI Detection [17.92210977820113]
We propose Guided-Embedding Network(GEN) to attain a two-branch pipeline without post-matching.
For the association, previous two-branch methods suffer from complex and costly post-matching.
For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery.
arXiv Detail & Related papers (2022-03-26T01:04:13Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.