FuguReport

An Attribute-Based Measure of Video Complexity

Authors Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati, Aashu Singh, Shlok Kumar Mishra, David Jacobs, Nuno Vasconcelos
Affiliations Meta / University of Maryland, College Park / Yale University / University of California, San Diego
Categories Evaluation / Model Evaluation / Measuring video LLM failure probability, Method / Complexity Metrics / Nonparametric attribute-based video complexity, Application / Video Retrieval / Video dataset attribute analysis
License CC BY 4.0

Abstract Overview

This paper proposes VideoABC, a non-parametric framework for estimating the complexity of a video-question pair for a video-LLM, where complexity is defined as the probability of model failure. The method represents each pair in a small attribute space built from interpretable video attributes such as event location, non-event proportion, scene complexity, and event speed, and then estimates complexity from quantized regions of that space. To support both in-distribution and out-of-distribution cases, the framework combines a k-means quantizer learned from real benchmark videos with a universal lattice quantizer supported by synthetic videos generated with controlled attributes. The experiments evaluate VideoABC across multiple video-LLMs and compare it with judge-based and MLP-based baselines using calibration-oriented metrics.

Novelty

The distinctive aspect of the work is its explicit, attribute-based and non-parametric definition of video complexity as failure probability, rather than relying on a holistic parametric scorer or an external LLM judge. It also introduces a hybrid quantization strategy—combining in-distribution k-means cells with a universal lattice quantizer—and a psychophysics-inspired synthetic video generation procedure to populate the attribute space when real reference data are limited.

Results

Across six target video-LLMs, VideoABC achieves the lowest expected calibration error among the compared methods; for example, on Qwen-3.5-VL 9B it reports 0.087 versus 0.171 for Judge, and on LLaVA-OV 7B it reports 0.058 versus 0.148 for Judge. The method also shows a favorable efficiency-performance trade-off: its reported inference latency is 226 ms, much lower than a 72B judge model at 1802 ms while yielding better calibration. Ablations further indicate that the combined quantizer gives the best practical trade-off, with the universal quantizer helping out-of-distribution generalization and the in-distribution quantizer helping in-distribution performance.

Key Points

  1. VideoABC estimates video-question complexity from interpretable attributes and expected failure rates of quantized attribute cells rather than from a parametric judge model.
  2. The framework combines real-data k-means quantization with a universal lattice quantizer supported by synthetic videos, addressing both in-distribution accuracy and out-of-distribution coverage.
  3. Empirically, VideoABC is better calibrated than judge, direct-attribute, and MLP baselines across several video-LLMs, while retaining relatively low inference latency.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.