IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
- URL: http://arxiv.org/abs/2510.11647v1
- Date: Mon, 13 Oct 2025 17:27:08 GMT
- Title: IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
- Authors: Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, Shuicheng Yan,
- Abstract summary: IVEBench is a benchmark suite specifically designed for instruction-guided video editing assessment.<n>It comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames.<n>IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity.
- Score: 108.8652018167452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.
Related papers
- UniVBench: Towards Unified Evaluation for Video Foundation Models [29.73247324829126]
Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework.<n>We introduce UniVBench, a benchmark for evaluating video foundation models across four core abilities.<n>Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos.
arXiv Detail & Related papers (2026-02-25T12:08:53Z) - ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning [57.08352504712699]
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing.<n>We introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing.<n>We propose ReViSE, a framework that unifies generation and evaluation within a single architecture.
arXiv Detail & Related papers (2025-12-10T18:57:09Z) - In-Context Learning with Unpaired Clips for Instruction-based Video Editing [51.943707933717185]
We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
arXiv Detail & Related papers (2025-10-16T13:02:11Z) - VideoScore2: Think before You Score in Generative Video Evaluation [69.43069741467603]
VideoScore2 is a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency.<n>Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos.
arXiv Detail & Related papers (2025-09-26T18:09:03Z) - TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs [54.44479359918971]
We introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing.<n> TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories.<n>We propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment.
arXiv Detail & Related papers (2025-05-26T05:47:09Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - VEU-Bench: Towards Comprehensive Understanding of Video Editing [4.9254235505057835]
We introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions.<n>Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging.<n>We develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and performance comparable to commercial models like GPT-4o.
arXiv Detail & Related papers (2025-04-24T04:36:28Z) - EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models [16.045012576543474]
Text-based video editing has emerged as a promising field, enabling precise modifications to videos based on text prompts.<n>Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score.<n>We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models.
arXiv Detail & Related papers (2024-09-15T08:43:18Z) - VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment [6.627422081288281]
We introduce VE-Bench, a benchmark suite tailored to the assessment of text-driven video editing.<n>This suite includes VE-Bench DB, a video quality assessment (VQA) database for video editing.<n> VE-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos.
arXiv Detail & Related papers (2024-08-21T09:49:32Z) - Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.