Related papers: Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

URL: http://arxiv.org/abs/2407.16124v1
Date: Tue, 23 Jul 2024 02:10:50 GMT
Title: Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos
Authors: Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, Renjie Liao,
Abstract summary: We propose Fr'echet Video Motion Distance metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr'echet distance. We carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
Score: 13.368981834953981
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fr\'echet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr\'echet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance.

Related papers

Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z)
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos [26.97196583891564]
We introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos.<n>Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria.<n>This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion.
arXiv Detail & Related papers (2025-05-26T15:24:06Z)
Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z)
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos. We advocate for the incorporation of a retrieval mechanism during the generation phase. Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z)
VMBench: A Benchmark for Perception-Aligned Video Motion Generation [22.891770315274346]
We introduce VMBench, a comprehensive Video Motion Benchmark. VMBench has perception-aligned motion metrics and features the most diverse types of motion. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment.
arXiv Detail & Related papers (2025-03-13T05:54:42Z)
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models [30.139277087078764]
MotionBench is an evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. It includes data collected from diverse sources, ensuring a broad representation of real-world video content. Our benchmark aims to guide and motivate the development of more capable video understanding models.
arXiv Detail & Related papers (2025-01-06T11:57:38Z)
Perceptual Video Quality Assessment: A Survey [63.61214597655413]
Perceptual video quality assessment plays a vital role in the field of video processing. Various subjective and objective video quality assessment studies have been conducted over the past two decades. This survey provides an up-to-date and comprehensive review of these video quality assessment studies.
arXiv Detail & Related papers (2024-02-05T16:13:52Z)
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [6.855409699832414]
Video generative models struggle to generate even short video clips. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks. We propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects.
arXiv Detail & Related papers (2024-01-30T08:18:20Z)
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
Tracking Everything Everywhere All at Once [111.00807055441028]
We present a new test-time optimization method for estimating dense and long-range motion from a video sequence. We propose a complete and globally consistent motion representation, dubbed OmniMotion. Our approach outperforms prior state-of-the-art methods by a large margin both quantitatively and qualitatively.
arXiv Detail & Related papers (2023-06-08T17:59:29Z)
LaMD: Latent Motion Diffusion for Image-Conditional Video Generation [63.34574080016687]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator. LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN.
arXiv Detail & Related papers (2023-04-23T10:32:32Z)
Saliency-Aware Spatio-Temporal Artifact Detection for Compressed Video Quality Assessment [16.49357671290058]
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Temporal Artifacts (PEAs) In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed.
arXiv Detail & Related papers (2023-01-03T12:48:27Z)
A Perceptual Quality Metric for Video Frame Interpolation [6.743340926667941]
As video frame results often unique artifacts, existing quality metrics sometimes are not consistent with human perception when measuring the results. Some recent deep learning-based quality metrics are shown more consistent with human judgments, but their performance on videos is compromised since they do not consider temporal information. Our method learns perceptual features directly from videos instead of individual frames.
arXiv Detail & Related papers (2022-10-04T19:56:10Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results. We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames. We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.