An Attribute-Based Measure of Video Complexity
Abstract Overview
This paper proposes VideoABC, a non-parametric framework for estimating the complexity of a video-question pair for a video-LLM, where complexity is defined as the probability of model failure. The method represents each pair in a small attribute space built from interpretable video attributes such as event location, non-event proportion, scene complexity, and event speed, and then estimates complexity from quantized regions of that space. To support both in-distribution and out-of-distribution cases, the framework combines a k-means quantizer learned from real benchmark videos with a universal lattice quantizer supported by synthetic videos generated with controlled attributes. The experiments evaluate VideoABC across multiple video-LLMs and compare it with judge-based and MLP-based baselines using calibration-oriented metrics.
Novelty
The distinctive aspect of the work is its explicit, attribute-based and non-parametric definition of video complexity as failure probability, rather than relying on a holistic parametric scorer or an external LLM judge. It also introduces a hybrid quantization strategy—combining in-distribution k-means cells with a universal lattice quantizer—and a psychophysics-inspired synthetic video generation procedure to populate the attribute space when real reference data are limited.
Results
Across six target video-LLMs, VideoABC achieves the lowest expected calibration error among the compared methods; for example, on Qwen-3.5-VL 9B it reports 0.087 versus 0.171 for Judge, and on LLaVA-OV 7B it reports 0.058 versus 0.148 for Judge. The method also shows a favorable efficiency-performance trade-off: its reported inference latency is 226 ms, much lower than a 72B judge model at 1802 ms while yielding better calibration. Ablations further indicate that the combined quantizer gives the best practical trade-off, with the universal quantizer helping out-of-distribution generalization and the in-distribution quantizer helping in-distribution performance.
Key Points
- VideoABC estimates video-question complexity from interpretable attributes and expected failure rates of quantized attribute cells rather than from a parametric judge model.
- The framework combines real-data k-means quantization with a universal lattice quantizer supported by synthetic videos, addressing both in-distribution accuracy and out-of-distribution coverage.
- Empirically, VideoABC is better calibrated than judge, direct-attribute, and MLP baselines across several video-LLMs, while retaining relatively low inference latency.