Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception
- URL: http://arxiv.org/abs/2405.16493v2
- Date: Wed, 30 Oct 2024 16:58:25 GMT
- Title: Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception
- Authors: Shuangpeng Han, Ziyu Wang, Mengmi Zhang,
- Abstract summary: Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns.
We propose the Motion Perceiver (MP), which relies on patch-level optical flows from video clips as inputs.
MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy.
- Score: 6.359236783105098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behaviors. Our data and code are available at https://github.com/ZhangLab-DeepNeuroCogLab/MotionPerceiver.
Related papers
- Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments [6.623088068354071]
We study the debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion.<n>We introduce the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints.<n>Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
arXiv Detail & Related papers (2026-03-02T12:38:43Z) - EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents [85.77432303199176]
We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
arXiv Detail & Related papers (2026-02-26T16:53:41Z) - World Action Models are Zero-shot Policies [111.91938055103633]
We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone.<n>By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data.<n>We demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance.
arXiv Detail & Related papers (2026-02-17T15:04:02Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - Recognizing Actions from Robotic View for Natural Human-Robot Interaction [52.00935005918032]
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary.<n>Existing benchmarks for N-HRI fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments.<n>We introduce (Action from Robotic View) a large-scale dataset for perception-centric robotic views prevalent in mobile service robots.
arXiv Detail & Related papers (2025-07-30T09:48:34Z) - Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers [1.1031714356680165]
Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions.<n>In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance.
arXiv Detail & Related papers (2025-07-21T17:44:10Z) - Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input [62.51283548975632]
This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches.
We present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs.
arXiv Detail & Related papers (2025-04-11T11:18:57Z) - SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments [9.786770726122436]
Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets.
Key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation.
In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics.
arXiv Detail & Related papers (2025-01-27T20:05:17Z) - Machine Learning Modeling for Multi-order Human Visual Motion Processing [5.043066132820344]
This research aims to develop machines that learn to perceive visual motion as do humans.
Our model architecture mimics the cortical V1-MT motion processing pathway.
We trained our dual-pathway model on novel motion datasets with varying material properties of moving objects.
arXiv Detail & Related papers (2025-01-22T11:41:41Z) - Object segmentation from common fate: Motion energy processing enables human-like zero-shot generalization to random dot stimuli [10.978614683038758]
We evaluate a broad range of optical flow models and a neuroscience inspired motion energy model for zero-shot figure-ground segmentation.
We find that a cross section of 40 deep optical flow models trained on different datasets struggle to estimate motion patterns in random dot videos.
This neuroscience-inspired model successfully addresses the lack of human-like zero-shot generalization to random dot stimuli in current computer vision models.
arXiv Detail & Related papers (2024-11-03T09:59:45Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - GiMeFive: Towards Interpretable Facial Emotion Classification [1.1468563069298348]
Deep convolutional neural networks have been shown to successfully recognize facial emotions.
We propose our model GiMeFive with interpretations, i.e., via layer activations and gradient-weighted class mapping.
Empirical results show that our model outperforms the previous methods in terms of accuracy.
arXiv Detail & Related papers (2024-02-24T00:37:37Z) - Neural feels with neural fields: Visuo-tactile perception for in-hand
manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation.
Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem.
Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - Modelling Human Visual Motion Processing with Trainable Motion Energy
Sensing and a Self-attention Network [1.9458156037869137]
We propose an image-computable model of human motion perception by bridging the gap between biological and computer vision models.
This model architecture aims to capture the computations in V1-MT, the core structure for motion perception in the biological visual system.
In silico neurophysiology reveals that our model's unit responses are similar to mammalian neural recordings regarding motion pooling and speed tuning.
arXiv Detail & Related papers (2023-05-16T04:16:07Z) - HumanMAC: Masked Motion Completion for Human Motion Prediction [62.279925754717674]
Human motion prediction is a classical problem in computer vision and computer graphics.
Previous effects achieve great empirical performance based on an encoding-decoding style.
In this paper, we propose a novel framework from a new perspective.
arXiv Detail & Related papers (2023-02-07T18:34:59Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.