HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2510.05609v1
- Date: Tue, 07 Oct 2025 06:16:02 GMT
- Title: HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection
- Authors: Junwen Chen, Peilin Xiong, Keiji Yanai,
- Abstract summary: We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text.<n>Results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability.
- Score: 6.608035306614831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.
Related papers
- Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z) - Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection [59.04089915447622]
ForenAgent is an interactive IFD framework that enables MLLMs to autonomously generate, execute, and refine Python-based low-level tools around the detection objective.<n>Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication.<n>Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks.
arXiv Detail & Related papers (2025-12-18T08:38:44Z) - Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z) - HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model [13.82578761807402]
We introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided fine-tuning with group relative policy optimization.<n>To mitigate hallucinations in the CoT reasoning, we introduce an "MLLM-as-a-judge" mechanism that supervises the CoT outputs.<n>Experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.
arXiv Detail & Related papers (2025-08-15T09:28:57Z) - Agentic Episodic Control [16.94652073521156]
Reinforcement learning (RL) has driven breakthroughs in AI, from game-play to scientific discovery and AI alignment.<n>Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task-agnostic planning.<n>We propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with large language models to enhance decision-making.
arXiv Detail & Related papers (2025-06-02T08:57:37Z) - DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z) - O1 Embedder: Let Retrievers Think Before Action [28.583031173137428]
We propose O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents.<n>Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets.<n>These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.
arXiv Detail & Related papers (2025-02-11T13:48:10Z) - Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective [77.94874338927492]
OpenAI has claimed that the main techinique behinds o1 is the reinforcement learning.<n>This paper analyzes the roadmap to achieving o1 from the perspective of reinforcement learning.
arXiv Detail & Related papers (2024-12-18T18:24:47Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.