Related papers: Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

URL: http://arxiv.org/abs/2512.09626v1
Date: Wed, 10 Dec 2025 13:11:43 GMT
Title: Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder
Authors: Yousef Azizi Movahed, Fatemeh Ziaeetabar,
Abstract summary: We introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors.<n>Our model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90.<n>These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.

Related papers

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction [16.426476430697587]
We present a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture.<n>Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios.
arXiv Detail & Related papers (2026-02-17T10:46:54Z)
URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z)
Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z)
TRACE: Learning to Compute on Graphs [15.34239150750753]
We introduce textbfTRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective.<n>First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation.<n>Second, we introduce textbffunction shift learning, a novel objective that decouples the learning problem.
arXiv Detail & Related papers (2025-09-26T05:22:32Z)
Explicit Multimodal Graph Modeling for Human-Object Interaction Detection [11.15526365654911]
Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs.<n>We propose textbfMultimodal textbfGraph textbfNetwork textbfModeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection.
arXiv Detail & Related papers (2025-09-16T01:17:49Z)
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective [28.829305407116962]
PESTO is a self-supervised learning approach for single-pitch estimation.<n>We develop a streamable VQT implementation using cached convolutions.
arXiv Detail & Related papers (2025-08-02T21:00:55Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
Higher-Order Convolution Improves Neural Predictivity in the Retina [0.7916635054977068]
We present a novel approach to neural response prediction that incorporates higher-order operations directly within convolutional neural networks (CNNs)<n>Our model extends traditional 3D CNNs by embedding higher-order operations within the convolutional operator itself.<n>We evaluate our approach on two distinct datasets: salamander retinal ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC responses to controlled geometric transformations.
arXiv Detail & Related papers (2025-05-12T14:43:32Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.