OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models
- URL: http://arxiv.org/abs/2501.00432v1
- Date: Tue, 31 Dec 2024 13:22:00 GMT
- Title: OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models
- Authors: Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz,
- Abstract summary: We propose an open vocabulary human-to-human interaction recognition framework.
We generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings.
Our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding.
- Score: 4.831029473163422
- License:
- Abstract: Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.
Related papers
- Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [85.63710017456792]
FuSe is a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities.
We show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound.
Experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
arXiv Detail & Related papers (2025-01-08T18:57:33Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction [2.1261712640167856]
Emotion recognition is a crucial task for human conversation understanding.
We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT)
CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
arXiv Detail & Related papers (2023-11-08T07:46:25Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction [0.0]
We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events.
The data show evidence that double-loop feedback is established during a face-to-face conversation.
We propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions.
arXiv Detail & Related papers (2022-11-24T21:17:36Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - Learning Adaptive Language Interfaces through Decomposition [89.21937539950966]
We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition.
Users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps.
arXiv Detail & Related papers (2020-10-11T08:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.