Related papers: Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

URL: http://arxiv.org/abs/2512.01803v2
Date: Tue, 02 Dec 2025 23:22:22 GMT
Title: Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Authors: Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram,
Abstract summary: We introduce a novel evaluation metric derived from a learned latent space of real-world human actions.<n>Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features.<n>Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution.
Score: 4.872114804382539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

Related papers

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z)
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars [32.75338796722652]
We propose a two-stage autoregressive adaptation and acceleration framework to adapt a high-fidelity human video diffusion model for real-time, interactive streaming.<n>We develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures.<n>Our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.
arXiv Detail & Related papers (2025-12-26T15:41:24Z)
High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer [17.388852038062705]
We propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos.<n>First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance.<n>Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module.
arXiv Detail & Related papers (2025-12-26T07:36:48Z)
Dynamic Avatar-Scene Rendering from Human-centric Context [75.95641456716373]
We propose bf Separate-then-Map (StM) strategy to bridge separately defined and optimized models.<n>StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy.
arXiv Detail & Related papers (2025-11-13T17:39:06Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.<n>We advocate for the incorporation of a retrieval mechanism during the generation phase.<n>Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z)
MoManifold: Learning to Measure 3D Human Motion via Decoupled Joint Acceleration Manifolds [20.83684434910106]
We present MoManifold, a novel human motion prior, which models plausible human motion in continuous high-dimensional motion space. Specifically, we propose novel decoupled joint acceleration to model human dynamics from existing limited motion data. Extensive experiments demonstrate that MoManifold outperforms existing SOTAs as a prior in several downstream tasks.
arXiv Detail & Related papers (2024-09-01T15:00:16Z)
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos [13.368981834953981]
We propose Fr'echet Video Motion Distance metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr'echet distance. We carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
arXiv Detail & Related papers (2024-07-23T02:10:50Z)
Aligning Human Motion Generation with Human Perceptions [51.831338643012444]
We propose a data-driven approach to bridge the gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic.<n>Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline.
arXiv Detail & Related papers (2024-07-02T14:01:59Z)
Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos [5.258814754543826]
We propose a novel method for temporally consistent motion estimation from a monocular video. Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose. Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-20T10:53:59Z)
Dynamic Future Net: Diversified Human Motion Generation [31.987602940970888]
Human motion modelling is crucial in many areas such as computer graphics, vision and virtual reality. We present Dynamic Future Net, a new deep learning model where we explicitly focuses on the intrinsic motionity of human motion dynamics. Our model can generate a large number of high-quality motions with arbitrary duration, and visuallyincing variations in both space and time.
arXiv Detail & Related papers (2020-08-25T02:31:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.