IF-VidCap: Can Video Caption Models Follow Instructions?
- URL: http://arxiv.org/abs/2510.18726v1
- Date: Tue, 21 Oct 2025 15:25:08 GMT
- Title: IF-VidCap: Can Video Caption Models Follow Instructions?
- Authors: Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu,
- Abstract summary: We introduce IF-VidCap, a new benchmark for evaluating controllable video captioning.<n>IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness.
- Score: 44.2412700621584
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
Related papers
- Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions [74.27249614046309]
ASID-1M is an open-source collection of one million structured, fine-grained audiovisual instruction annotations.<n>ASID-Verify is a scalable data curation pipeline for annotation.<n>ASID-Captioner is a video understanding model trained via Supervised Fine-Tuning.
arXiv Detail & Related papers (2026-02-13T15:20:54Z) - GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration [57.5390432800788]
We propose GLaVE-Cap, a Global-Local aligned framework with Vision Expert integration for Captioning.<n>We construct GLaVE-Bench, a comprehensive video captioning benchmark featuring 5X more queries per video than existing benchmarks.<n>We also provide a training dataset GLaVE-1.2M containing 16K high-quality fine-grained video captions and 1.2M related question-answer pairs.
arXiv Detail & Related papers (2025-09-14T17:25:55Z) - Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation [118.5096631571738]
We present Any2Caption, a novel framework for controllable video generation under any condition.<n>By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions.<n> Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models.
arXiv Detail & Related papers (2025-03-31T17:59:01Z) - CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness [30.44039177018447]
CAPability is a comprehensive benchmark for evaluating visual captioning across 12 dimensions spanning six critical views.<n>We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions.
arXiv Detail & Related papers (2025-02-19T07:55:51Z) - CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval.<n>Uniquely, it provides manually separated spatial annotations and temporal annotations for each video.<n>Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.