2026-05-04 Daily Report: VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Authors Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna

Affiliations University of Washington / Stanford University / Allen Institute for AI

Categories Evaluation / Action Recognition Benchmark / Domain-specific benchmark with 37 domains, Application / Video Understanding / Training and evaluation on large-scale video Q&A data, Method / Vision-Language Modeling / Contextual learning challenges in VLMs

License CC BY 4.0

Abstract Overview

VideoNet is a domain-specific action recognition benchmark covering 1,000 actions across 37 domains, designed to evaluate modern vision-language models on fine-grained action understanding. The benchmark comprises 5,000 clips with expert verification indicating approximately 97% label accuracy, and introduces multiple-choice, binary, and few-shot evaluation settings. Experiments reveal that current VLMs, especially open-weight models, struggle on domain-specific actions and gain limited benefit from in-context video examples compared to humans. The authors also construct a large-scale training dataset of approximately 160,000 clips (yielding nearly 500K video QA pairs) using automated pipelines, and fine-tune a Molmo2-4B model that substantially outperforms its base version and all evaluated open 8B models on VideoNet.

Novelty

The main novelty is the introduction of a large-scale benchmark specifically targeting domain-specific action recognition across 37 diverse domains, offering broader coverage than prior narrowly-scoped fine-grained or coarse-grained datasets. The work is also distinctive in pairing this benchmark with the first large-scale automatically-collected training dataset for domain-specific actions, using pipelines that avoid reliance on domain experts.

Results

On the multiple-choice benchmark, the best proprietary model (Gemini 3.1 Pro) reaches 69.9% accuracy while the best open-weight 8B model (Qwen3-VL-8B) reaches only 45.0%. The fine-tuned Molmo2-4B achieves 53.5% on multiple-choice (+11.5 points over base) and 66.6% on binary 0-shot (+11.3 points over base), surpassing all evaluated open 8B models. Few-shot visual examples help humans far more than models, with humans improving by 13.6 points in the 3-shot setting while average model gains are only about 3 points and some models (e.g., Gemini 3.1 Pro) decline.

Key Points

VideoNet benchmarks 1,000 actions from 37 domains using carefully curated clips with hard negatives and expert-verified labels indicating approximately 97% accuracy.
Current VLMs show limited domain-specific action understanding and weak utilization of in-context video examples, with open models performing substantially worse than proprietary systems and far below human few-shot learning gains.
Training on automatically collected domain-specific action data (162K clips with strict filtering) is more effective than relying on test-time few-shot examples, enabling a fine-tuned 4B model to outperform all evaluated open 8B models on the benchmark.

References

arXiv: https://arxiv.org/abs/2605.02834v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.02834v1
Project: https://tanu.sh/research/videonet

Project