Empowering Agentic Video Analytics Systems with Video Language Models
- URL: http://arxiv.org/abs/2505.00254v1
- Date: Thu, 01 May 2025 02:40:23 GMT
- Title: Empowering Agentic Video Analytics Systems with Video Language Models
- Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu,
- Abstract summary: Video-Language Models (VLMs) offer potential for enabling open-ended video understanding, reasoning, and analytics.<n>VLMs' limited context windows present challenges when processing ultra-long video content.<n>We introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics.
- Score: 16.962065998310223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.
Related papers
- VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.
We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.
We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts.
We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering [6.088350050879401]
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering.
The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation.
By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator.
arXiv Detail & Related papers (2023-11-25T02:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.