OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
- URL: http://arxiv.org/abs/2508.18634v2
- Date: Wed, 27 Aug 2025 03:53:24 GMT
- Title: OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
- Authors: Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, Xiang Bai,
- Abstract summary: Video captioning methods often suffer from motion-detail imbalance.<n>OwlCap is a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance.
- Score: 47.36044825976202
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.
Related papers
- IF-VidCap: Can Video Caption Models Follow Instructions? [44.2412700621584]
We introduce IF-VidCap, a new benchmark for evaluating controllable video captioning.<n>IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness.
arXiv Detail & Related papers (2025-10-21T15:25:08Z) - AVC-DPO: Aligned Video Captioning via Direct Preference Optimization [50.08618093204503]
Video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks.<n>We propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment.<n>We have achieved exceptional performance in the LOVE@PRCV'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning benchmark.
arXiv Detail & Related papers (2025-07-02T08:51:45Z) - video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models [33.70837005629285]
We present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA)<n>We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using directed preference optimisation (DPO)<n> Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28%.
arXiv Detail & Related papers (2025-06-18T07:58:41Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation [31.44879457190659]
We propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation.<n>Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.
arXiv Detail & Related papers (2025-04-07T15:24:54Z) - Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation [118.5096631571738]
We present Any2Caption, a novel framework for controllable video generation under any condition.<n>By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions.<n> Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models.
arXiv Detail & Related papers (2025-03-31T17:59:01Z) - MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models [30.139277087078764]
MotionBench is an evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.<n>It includes data collected from diverse sources, ensuring a broad representation of real-world video content.<n>Our benchmark aims to guide and motivate the development of more capable video understanding models.
arXiv Detail & Related papers (2025-01-06T11:57:38Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a compact video autoencoder that decouples video into two distinct latent spaces.<n>Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.<n>Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.