First Place Solution to the CVPR'2023 AQTC Challenge: A
Function-Interaction Centric Approach with Spatiotemporal Visual-Language
Alignment
- URL: http://arxiv.org/abs/2306.13380v1
- Date: Fri, 23 Jun 2023 09:02:25 GMT
- Title: First Place Solution to the CVPR'2023 AQTC Challenge: A
Function-Interaction Centric Approach with Spatiotemporal Visual-Language
Alignment
- Authors: Tom Tongjia Chen, Hongshan Yu, Zhengeng Yang, Ming Li, Zechuan Li,
Jingwen Wang, Wei Miao, Wei Sun, Chen Chen
- Abstract summary: Affordance-Centric Question-driven Task Completion (AQTC) has been proposed to acquire from videos to users with comprehensive and systematic instructions.
Existing methods have neglected the necessity of aligning visual and linguistic signals, as well as the crucial interactional information between humans objects.
We propose to combine largescale pre-trained vision- and video-language models, which serve to contribute stable and reliable multimodal data.
- Score: 15.99008977852437
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Affordance-Centric Question-driven Task Completion (AQTC) has been proposed
to acquire knowledge from videos to furnish users with comprehensive and
systematic instructions. However, existing methods have hitherto neglected the
necessity of aligning spatiotemporal visual and linguistic signals, as well as
the crucial interactional information between humans and objects. To tackle
these limitations, we propose to combine large-scale pre-trained
vision-language and video-language models, which serve to contribute stable and
reliable multimodal data and facilitate effective spatiotemporal visual-textual
alignment. Additionally, a novel hand-object-interaction (HOI) aggregation
module is proposed which aids in capturing human-object interaction
information, thereby further augmenting the capacity to understand the
presented scenario. Our method achieved first place in the CVPR'2023 AQTC
Challenge, with a Recall@1 score of 78.7\%. The code is available at
https://github.com/tomchen-ctj/CVPR23-LOVEU-AQTC.
Related papers
- Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models [5.541130887628606]
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME)
We introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes.
This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME.
arXiv Detail & Related papers (2024-10-01T01:14:24Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Mining Conditional Part Semantics with Occluded Extrapolation for
Human-Object Interaction Detection [16.9278983497498]
Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding.
Existing methods try to use human-related clues to alleviate the difficulty, but rely heavily on external annotations or knowledge.
We propose a novel Part Semantic Network (PSN) to solve this problem.
arXiv Detail & Related papers (2023-07-19T23:55:15Z) - Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.