Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2507.06510v1
- Date: Wed, 09 Jul 2025 03:16:39 GMT
- Title: Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
- Authors: Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu,
- Abstract summary: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all human, verb, object> triplets of interest in an image.<n>Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs)<n>We propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI)
- Score: 29.24483392547041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Related papers
- Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z) - Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z) - Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration [42.24582981160835]
Open Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects.<n>Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders.<n>We propose INteraction-aware Prompting with Concept (INP-CC), an end-to-end open-vocabulary HOI detector.
arXiv Detail & Related papers (2025-08-05T08:33:58Z) - HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction [55.00788339683146]
We propose a novel Hierarchical vision-Language collaboration framework for improved survival prediction.<n> Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels.<n>This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts.
arXiv Detail & Related papers (2025-07-07T02:06:25Z) - Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning [26.35257570870916]
Visual-Linguistic Agent (VLA) is a collaborative framework that combines the relational reasoning strengths of MLLMs with the precise localization capabilities of traditional object detectors.
VLA significantly enhances both spatial reasoning and object localization, addressing key challenges in multimodal understanding.
arXiv Detail & Related papers (2024-11-15T15:02:06Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Visual Compositional Learning for Human-Object Interaction Detection [111.05263071111807]
Human-Object interaction (HOI) detection aims to localize and infer relationships between human and objects in an image.
It is challenging because an enormous number of possible combinations of objects and verbs types forms a long-tail distribution.
We devise a deep Visual Compositional Learning framework, which is a simple yet efficient framework to effectively address this problem.
arXiv Detail & Related papers (2020-07-24T08:37:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.