Related papers: The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting

The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting

URL: http://arxiv.org/abs/2105.05332v1
Date: Tue, 11 May 2021 20:13:53 GMT
Title: The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting
Authors: Ryan Szeto, Jason J. Corso
Abstract summary: We propose the Diagnostic Evaluation of Video Inpainting on Landscapes (DEVIL) benchmark, which consists of two contributions. Our challenging benchmark enables more insightful analysis into video inpainting methods and serves as an invaluable diagnostic tool for the field.
Score: 43.90848669491335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantitative evaluation has increased dramatically among recent video inpainting work, but the video and mask content used to gauge performance has received relatively little attention. Although attributes such as camera and background scene motion inherently change the difficulty of the task and affect methods differently, existing evaluation schemes fail to control for them, thereby providing minimal insight into inpainting failure modes. To address this gap, we propose the Diagnostic Evaluation of Video Inpainting on Landscapes (DEVIL) benchmark, which consists of two contributions: (i) a novel dataset of videos and masks labeled according to several key inpainting failure modes, and (ii) an evaluation scheme that samples slices of the dataset characterized by a fixed content attribute, and scores performance on each slice according to reconstruction, realism, and temporal consistency quality. By revealing systematic changes in performance induced by particular characteristics of the input content, our challenging benchmark enables more insightful analysis into video inpainting methods and serves as an invaluable diagnostic tool for the field. Our code is available at https://github.com/MichiganCOG/devil .

Related papers

Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Elevating Flow-Guided Video Inpainting with Reference Generation [50.03502211226332]
Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. We propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts.
arXiv Detail & Related papers (2024-12-12T06:13:00Z)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE) HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs. To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z)
Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices. VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z)
Detecting Inpainted Video with Frequency Domain Insights [0.0]
We propose the Frequency Domain Insights Network (FDIN), which significantly enhances detection accuracy. Previous evaluations on public datasets demonstrate that FDIN achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-21T01:51:07Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF) LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)
Transfer of Representations to Video Label Propagation: Implementation Factors Matter [31.030799003595522]
We study the impact of important implementation factors in feature extraction and label propagation. We show that augmenting video-based correspondence cues with still-image-based ones can further improve performance. We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence.
arXiv Detail & Related papers (2022-03-10T18:58:22Z)
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations. We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z)
Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training [20.288424566444224]
We focus on automatically assessing the quality of in-the-wild videos in computer vision applications. To improve the performance of quality assessment models, we borrow intuitions from human perception. We propose a mixed datasets training strategy for training a single VQA model with multiple datasets.
arXiv Detail & Related papers (2020-11-09T09:22:57Z)
Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results. We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.