Related papers: EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

URL: http://arxiv.org/abs/2409.09668v1
Date: Sun, 15 Sep 2024 08:43:18 GMT
Title: EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models
Authors: Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie,
Abstract summary: Text-based video editing has emerged as a promising field, enabling precise modifications to videos based on text prompts. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score. We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models.
Score: 16.045012576543474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

Related papers

UniVBench: Towards Unified Evaluation for Video Foundation Models [29.73247324829126]
Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework.<n>We introduce UniVBench, a benchmark for evaluating video foundation models across four core abilities.<n>Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos.
arXiv Detail & Related papers (2026-02-25T12:08:53Z)
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning [57.08352504712699]
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing.<n>We introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing.<n>We propose ReViSE, a framework that unifies generation and evaluation within a single architecture.
arXiv Detail & Related papers (2025-12-10T18:57:09Z)
Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z)
Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations [8.479321655643195]
We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications.<n> RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes.<n> RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality.
arXiv Detail & Related papers (2025-11-18T03:37:19Z)
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models [26.762431651154607]
We propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers.<n>We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines.
arXiv Detail & Related papers (2025-06-16T16:04:16Z)
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs [54.44479359918971]
We introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing.<n> TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories.<n>We propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment.
arXiv Detail & Related papers (2025-05-26T05:47:09Z)
DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models [1.972901110298768]
We propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning framework for video editing.<n>In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos.<n>The second stage introduces a vision-friendly adapter to improve visual quality.
arXiv Detail & Related papers (2025-05-11T17:08:50Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models [17.788970036356297]
We introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime.
arXiv Detail & Related papers (2025-03-17T19:47:41Z)
DreamOmni: Unified Image Generation and Editing [51.45871494724542]
We introduce Dream Omni, a unified model for image generation and editing. For training, Dream Omni jointly trains T2I generation and downstream tasks. This collaboration significantly boosts editing performance.
arXiv Detail & Related papers (2024-12-22T17:17:28Z)
SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent) SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z)
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework. It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback. VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment [6.627422081288281]
E-Bench is a benchmark suite tailored to the assessment of text-driven video editing. It includes E-Bench DB, a video quality assessment (VQA) database for video editing. E-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos.
arXiv Detail & Related papers (2024-08-21T09:49:32Z)
Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z)
Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore) This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA. Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.