MIDAS: Deep learning human action intention prediction from natural eye
movement patterns
- URL: http://arxiv.org/abs/2201.09135v1
- Date: Sat, 22 Jan 2022 21:52:42 GMT
- Title: MIDAS: Deep learning human action intention prediction from natural eye
movement patterns
- Authors: Paul Festor, Ali Shafti, Alex Harston, Michey Li, Pavel Orlov, A. Aldo
Faisal
- Abstract summary: We present an entirely data-driven approach to decode human intention for object manipulation tasks based solely on natural gaze cues.
Our results show that we can decode human intention of motion purely from natural gaze cues and object relative position, with $91.9%$ accuracy.
- Score: 6.557082555839739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Eye movements have long been studied as a window into the attentional
mechanisms of the human brain and made accessible as novelty style
human-machine interfaces. However, not everything that we gaze upon, is
something we want to interact with; this is known as the Midas Touch problem
for gaze interfaces. To overcome the Midas Touch problem, present interfaces
tend not to rely on natural gaze cues, but rather use dwell time or gaze
gestures. Here we present an entirely data-driven approach to decode human
intention for object manipulation tasks based solely on natural gaze cues. We
run data collection experiments where 16 participants are given manipulation
and inspection tasks to be performed on various objects on a table in front of
them. The subjects' eye movements are recorded using wearable eye-trackers
allowing the participants to freely move their head and gaze upon the scene. We
use our Semantic Fovea, a convolutional neural network model to obtain the
objects in the scene and their relation to gaze traces at every frame. We then
evaluate the data and examine several ways to model the classification task for
intention prediction. Our evaluation shows that intention prediction is not a
naive result of the data, but rather relies on non-linear temporal processing
of gaze cues. We model the task as a time series classification problem and
design a bidirectional Long-Short-Term-Memory (LSTM) network architecture to
decode intentions. Our results show that we can decode human intention of
motion purely from natural gaze cues and object relative position, with
$91.9\%$ accuracy. Our work demonstrates the feasibility of natural gaze as a
Zero-UI interface for human-machine interaction, i.e., users will only need to
act naturally, and do not need to interact with the interface itself or deviate
from their natural eye movement patterns.
Related papers
- A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos [10.149523817328921]
We introduce a novel method for simulating human gaze behavior.
Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer.
arXiv Detail & Related papers (2024-04-10T21:14:33Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Neural feels with neural fields: Visuo-tactile perception for in-hand
manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation.
Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem.
Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z) - Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses [11.545286742778977]
We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities.
We then present Pose2Gaze, a eye-body coordination model that uses a convolutional neural network to extract features from head direction and full-body poses.
arXiv Detail & Related papers (2023-12-19T10:55:46Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task [2.092312847886424]
We build deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts.
We find that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze.
arXiv Detail & Related papers (2022-12-20T15:48:48Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.