Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder
- URL: http://arxiv.org/abs/2512.09626v1
- Date: Wed, 10 Dec 2025 13:11:43 GMT
- Title: Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder
- Authors: Yousef Azizi Movahed, Fatemeh Ziaeetabar,
- Abstract summary: We introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors.<n>Our model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90.<n>These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
Related papers
- Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction [16.426476430697587]
We present a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture.<n>Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios.
arXiv Detail & Related papers (2026-02-17T10:46:54Z) - URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z) - TRACE: Learning to Compute on Graphs [15.34239150750753]
We introduce textbfTRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective.<n>First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation.<n>Second, we introduce textbffunction shift learning, a novel objective that decouples the learning problem.
arXiv Detail & Related papers (2025-09-26T05:22:32Z) - Explicit Multimodal Graph Modeling for Human-Object Interaction Detection [11.15526365654911]
Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs.<n>We propose textbfMultimodal textbfGraph textbfNetwork textbfModeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection.
arXiv Detail & Related papers (2025-09-16T01:17:49Z) - PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective [28.829305407116962]
PESTO is a self-supervised learning approach for single-pitch estimation.<n>We develop a streamable VQT implementation using cached convolutions.
arXiv Detail & Related papers (2025-08-02T21:00:55Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Higher-Order Convolution Improves Neural Predictivity in the Retina [0.7916635054977068]
We present a novel approach to neural response prediction that incorporates higher-order operations directly within convolutional neural networks (CNNs)<n>Our model extends traditional 3D CNNs by embedding higher-order operations within the convolutional operator itself.<n>We evaluate our approach on two distinct datasets: salamander retinal ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC responses to controlled geometric transformations.
arXiv Detail & Related papers (2025-05-12T14:43:32Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.