Related papers: TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

URL: http://arxiv.org/abs/2505.19535v1
Date: Mon, 26 May 2025 05:47:09 GMT
Title: TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
Authors: Juntong Wang, Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min,
Abstract summary: We introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing.<n> TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories.<n>We propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment.
Score: 54.44479359918971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.

Related papers

VEU-Bench: Towards Comprehensive Understanding of Video Editing [4.9254235505057835]
We introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions.<n>Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging.<n>We develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and performance comparable to commercial models like GPT-4o.
arXiv Detail & Related papers (2025-04-24T04:36:28Z)
FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models [17.788970036356297]
We introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models.<n>Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks.<n>We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime.
arXiv Detail & Related papers (2025-03-17T19:47:41Z)
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts. We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z)
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [11.09708780767668]
We present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.
arXiv Detail & Related papers (2024-11-17T11:48:01Z)
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models [16.045012576543474]
Text-based video editing has emerged as a promising field, enabling precise modifications to videos based on text prompts.<n>Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score.<n>We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models.
arXiv Detail & Related papers (2024-09-15T08:43:18Z)
VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment [6.627422081288281]
We introduce VE-Bench, a benchmark suite tailored to the assessment of text-driven video editing.<n>This suite includes VE-Bench DB, a video quality assessment (VQA) database for video editing.<n> VE-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos.
arXiv Detail & Related papers (2024-08-21T09:49:32Z)
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model [56.03592388332793]
We investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives.<n>For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos.<n>We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment.<n>We propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos.
arXiv Detail & Related papers (2024-07-31T07:54:26Z)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore) This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z)
StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability [56.462032266188785]
Video shakiness is an unpleasant distortion of User Generated Content (UGC) videos, which is usually caused by the unstable hold of cameras. We build a new database named 1,952 diversely-shaky videos, where each video has a Mean Opinion Score (MOS) on the degree of video stability rated by 34 subjects. We elaborately design a novel VQA-S model named StableVQA, which consists of three feature extractors to acquire the optical flow, semantic, and blur features respectively, and a regression layer to predict the final stability score.
arXiv Detail & Related papers (2023-08-09T12:04:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.