VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Abstract Overview
VideoNet is a domain-specific action recognition benchmark covering 1,000 actions across 37 domains, designed to evaluate modern vision-language models on fine-grained action understanding. The benchmark comprises 5,000 clips with expert verification indicating approximately 97% label accuracy, and introduces multiple-choice, binary, and few-shot evaluation settings. Experiments reveal that current VLMs, especially open-weight models, struggle on domain-specific actions and gain limited benefit from in-context video examples compared to humans. The authors also construct a large-scale training dataset of approximately 160,000 clips (yielding nearly 500K video QA pairs) using automated pipelines, and fine-tune a Molmo2-4B model that substantially outperforms its base version and all evaluated open 8B models on VideoNet.
Novelty
The main novelty is the introduction of a large-scale benchmark specifically targeting domain-specific action recognition across 37 diverse domains, offering broader coverage than prior narrowly-scoped fine-grained or coarse-grained datasets. The work is also distinctive in pairing this benchmark with the first large-scale automatically-collected training dataset for domain-specific actions, using pipelines that avoid reliance on domain experts.
Results
On the multiple-choice benchmark, the best proprietary model (Gemini 3.1 Pro) reaches 69.9% accuracy while the best open-weight 8B model (Qwen3-VL-8B) reaches only 45.0%. The fine-tuned Molmo2-4B achieves 53.5% on multiple-choice (+11.5 points over base) and 66.6% on binary 0-shot (+11.3 points over base), surpassing all evaluated open 8B models. Few-shot visual examples help humans far more than models, with humans improving by 13.6 points in the 3-shot setting while average model gains are only about 3 points and some models (e.g., Gemini 3.1 Pro) decline.
Key Points
- VideoNet benchmarks 1,000 actions from 37 domains using carefully curated clips with hard negatives and expert-verified labels indicating approximately 97% accuracy.
- Current VLMs show limited domain-specific action understanding and weak utilization of in-context video examples, with open models performing substantially worse than proprietary systems and far below human few-shot learning gains.
- Training on automatically collected domain-specific action data (162K clips with strict filtering) is more effective than relying on test-time few-shot examples, enabling a fine-tuned 4B model to outperform all evaluated open 8B models on the benchmark.