HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
- URL: http://arxiv.org/abs/2512.00885v1
- Date: Sun, 30 Nov 2025 13:15:02 GMT
- Title: HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
- Authors: Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi,
- Abstract summary: We introduce HanDyVQA, a fine-grained video-answering benchmark that comprehensively covers both manipulation and effect of HOI.<n>HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts) totalling 11.1K multiple-choice QA pairs.<n>We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5Pro-2.5, reached only 73% average accuracy.
- Score: 29.120898351831443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.
Related papers
- Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement [33.737685950541795]
Articulation in Motion (AiM) reconstructs an interactive 3D digital replica from a user-object interaction video and a start-state scan.<n>We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object.<n>It uses motion cues to segment the object into parts and assign articulation joints.
arXiv Detail & Related papers (2026-03-03T12:07:06Z) - Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing [20.40288070674112]
We propose an end-to-end Interaction-aware Transformer (InterFormer)<n>It integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss.<n>Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets.
arXiv Detail & Related papers (2026-02-24T06:39:18Z) - SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding [48.64661382961745]
We introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos.<n>To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs.<n> Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes.
arXiv Detail & Related papers (2025-10-14T22:10:49Z) - DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis [17.750053029702222]
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations.<n>We introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion.<n>For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric.<n>Our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level.
arXiv Detail & Related papers (2025-10-08T18:41:04Z) - Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD)<n>Our key insight is to employ specialized layout representation for hands and objects, respectively.<n>To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z) - Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.<n>In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.<n>We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - InterTracker: Discovering and Tracking General Objects Interacting with
Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects.
We propose to leverage hand-object interaction to track interactive objects.
Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z) - Interacting Hand-Object Pose Estimation via Dense Mutual Attention [97.26400229871888]
3D hand-object pose estimation is the key to the success of many computer vision applications.
We propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object.
Our method is able to produce physically plausible poses with high quality and real-time inference speed.
arXiv Detail & Related papers (2022-11-16T10:01:33Z) - ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.