Related papers: Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

URL: http://arxiv.org/abs/2507.06510v1
Date: Wed, 09 Jul 2025 03:16:39 GMT
Title: Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Authors: Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu,
Abstract summary: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all human, verb, object> triplets of interest in an image.<n>Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs)<n>We propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI)
Score: 29.24483392547041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.

Related papers

Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration [42.24582981160835]
Open Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects.<n>Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders.<n>We propose INteraction-aware Prompting with Concept (INP-CC), an end-to-end open-vocabulary HOI detector.
arXiv Detail & Related papers (2025-08-05T08:33:58Z)
HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction [55.00788339683146]
We propose a novel Hierarchical vision-Language collaboration framework for improved survival prediction.<n> Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels.<n>This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts.
arXiv Detail & Related papers (2025-07-07T02:06:25Z)
Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning [26.35257570870916]
Visual-Linguistic Agent (VLA) is a collaborative framework that combines the relational reasoning strengths of MLLMs with the precise localization capabilities of traditional object detectors. VLA significantly enhances both spatial reasoning and object localization, addressing key challenges in multimodal understanding.
arXiv Detail & Related papers (2024-11-15T15:02:06Z)
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction. Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets. Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z)
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement. Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z)
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector. It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Visual Compositional Learning for Human-Object Interaction Detection [111.05263071111807]
Human-Object interaction (HOI) detection aims to localize and infer relationships between human and objects in an image. It is challenging because an enormous number of possible combinations of objects and verbs types forms a long-tail distribution. We devise a deep Visual Compositional Learning framework, which is a simple yet efficient framework to effectively address this problem.
arXiv Detail & Related papers (2020-07-24T08:37:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.