Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality
- URL: http://arxiv.org/abs/2410.05203v2
- Date: Tue, 8 Oct 2024 17:46:12 GMT
- Title: Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality
- Authors: Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal,
- Abstract summary: The Fr't Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality.
Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; and (3) the impractical sample sizes required for reliable estimation.
- Score: 8.068194154084967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Fr\'echet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.
Related papers
- PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting [54.7468067660037]
PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.
Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS.
arXiv Detail & Related papers (2024-10-29T15:28:15Z) - CatFree3D: Category-agnostic 3D Object Detection with Diffusion [63.75470913278591]
We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction.
We also introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results.
arXiv Detail & Related papers (2024-08-22T22:05:57Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - On the Content Bias in Fréchet Video Distance [42.717821654744796]
Fr'echet Video Distance (FVD) is a prominent metric for evaluating video generation models.
In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism.
We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality.
arXiv Detail & Related papers (2024-04-18T17:59:58Z) - Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding [55.32861154245772]
Calib3D is a pioneering effort to benchmark and scrutinize the reliability of 3D scene understanding models.
We evaluate 28 state-of-the-art models across 10 diverse 3D datasets.
We introduce DeptS, a novel depth-aware scaling approach aimed at enhancing 3D model calibration.
arXiv Detail & Related papers (2024-03-25T17:59:59Z) - STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [6.855409699832414]
Video generative models struggle to generate even short video clips.
Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks.
We propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects.
arXiv Detail & Related papers (2024-01-30T08:18:20Z) - Activating Frequency and ViT for 3D Point Cloud Quality Assessment
without Reference [0.49157446832511503]
We propose no-reference quality metric of a given 3D-PC.
To map the input attributes to quality score, we use a light-weight hybrid deep model; combined of Deformable Convolutional Network (DCN) and Vision Transformers (ViT)
The results show that our approach outperforms state-of-the-art NR-PCQA measures and even some FR-PCQA on PointXR.
arXiv Detail & Related papers (2023-12-10T19:13:34Z) - Towards Domain Generalization for Multi-view 3D Object Detection in
Bird-Eye-View [11.958753088613637]
We first analyze the causes of the domain gap for the MV3D-Det task.
To acquire a robust depth prediction, we propose to decouple the depth estimation from intrinsic parameters of the camera.
We modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic.
arXiv Detail & Related papers (2023-03-03T02:59:13Z) - From 2D to 3D: Re-thinking Benchmarking of Monocular Depth Prediction [80.67873933010783]
We argue that MDP is currently witnessing benchmark over-fitting and relying on metrics that are only partially helpful to gauge the usefulness of the predictions for 3D applications.
This limits the design and development of novel methods that are truly aware of - and improving towards estimating - the 3D structure of the scene rather than optimizing 2D-based distances.
We propose a set of metrics well suited to evaluate the 3D geometry of MDP approaches and a novel indoor benchmark, RIO-D3D, crucial for the proposed evaluation methodology.
arXiv Detail & Related papers (2022-03-15T17:50:54Z) - Uncertainty-Aware Camera Pose Estimation from Points and Lines [101.03675842534415]
Perspective-n-Point-and-Line (Pn$PL) aims at fast, accurate and robust camera localizations with respect to a 3D model from 2D-3D feature coordinates.
arXiv Detail & Related papers (2021-07-08T15:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.