FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
- URL: http://arxiv.org/abs/2509.09111v1
- Date: Thu, 11 Sep 2025 02:50:03 GMT
- Title: FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
- Authors: Jianqin Gao, Tianqi Wang, Yu Zhang, Yishu Zhang, Chenyuan Wang, Allan Dong, Zihao Wang,
- Abstract summary: Mobile devices have created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management.<n>We introduce the FPI-Det, containing 22,879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios.
- Score: 20.181223336698675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.
Related papers
- From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics [0.0]
This paper investigates the capabilities of state-of-the-art Visual Language Models (VLMs) for the task of Scene and Action Recognition.<n>The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscapes, on-campus and indoor scenarios.<n>The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information.
arXiv Detail & Related papers (2025-11-04T09:58:29Z) - Quantifying the Impact of Motion on 2D Gaze Estimation in Real-World Mobile Interactions [18.294511216241805]
This paper provides empirical evidence on how user mobility and behaviour affect mobile gaze tracking accuracy.<n>Head distance, head pose, and device orientation are key factors affecting accuracy.<n>Findings highlight the need for more robust, adaptive eye-tracking systems.
arXiv Detail & Related papers (2025-02-14T21:44:52Z) - On-device modeling of user's social context and familiar places from
smartphone-embedded sensor data [7.310043452300736]
This paper proposes an unsupervised and lightweight approach to model the user's social context and locations directly on the mobile device.
For the social context, the approach utilizes data on physical and cyber social interactions among users and their devices.
The effectiveness of the proposed approach is demonstrated through three sets of experiments, employing five real-world datasets.
arXiv Detail & Related papers (2023-06-27T12:53:14Z) - Video-based Pose-Estimation Data as Source for Transfer Learning in
Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments.
Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data.
This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z) - On-device modeling of user's social context and familiar places from
smartphone-embedded sensor data [7.310043452300736]
We propose a novel, unsupervised and lightweight approach to model the user's social context and her locations.
We exploit data related to both physical and cyber social interactions among users and their devices.
We show the performance of 3 machine learning algorithms to recognize daily-life situations.
arXiv Detail & Related papers (2022-05-18T08:32:26Z) - Egocentric Human-Object Interaction Detection Exploiting Synthetic Data [19.220651860718892]
We consider the problem of detecting Egocentric HumanObject Interactions (EHOIs) in industrial contexts.
We propose a pipeline and a tool to generate photo-realistic synthetic First Person Vision (FPV) images automatically labeled for EHOI detection.
arXiv Detail & Related papers (2022-04-14T15:59:15Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - A Computer Vision System to Help Prevent the Transmission of COVID-19 [79.62140902232628]
The COVID-19 pandemic affects every area of daily life globally.
Health organizations advise social distancing, wearing face mask, and avoiding touching face.
We developed a deep learning-based computer vision system to help prevent the transmission of COVID-19.
arXiv Detail & Related papers (2021-03-16T00:00:04Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - GID-Net: Detecting Human-Object Interaction with Global and Instance
Dependency [67.95192190179975]
We introduce a two-stage trainable reasoning mechanism, referred to as GID block.
GID-Net is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch.
We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-11T11:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.