LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
- URL: http://arxiv.org/abs/2502.15393v2
- Date: Sat, 01 Mar 2025 02:06:59 GMT
- Title: LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
- Authors: Hongchen Wei, Zhihong Tan, Yaosi Hu, Chang Wen Chen, Zhenzhong Chen,
- Abstract summary: Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks.<n>In this paper, we investigate the limitations of LMMs in generating long captions for long videos.<n>We propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation.
- Score: 52.05596926411973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.
Related papers
- LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models [60.79418872734049]
LongWriter-V-22k is a dataset of 22,158 examples with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words.<n>We propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs.<n>Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on a benchmark.
arXiv Detail & Related papers (2025-02-20T18:47:36Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - LVD-2M: A Long-take Video Dataset with Temporally Dense Captions [68.88624389174026]
We introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions.
Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality.
We curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions.
arXiv Detail & Related papers (2024-10-14T17:59:56Z) - Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows.
We propose to adapt LMMs for long video understanding tasks by extending the visual context window.
Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z) - Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [25.61734041983714]
Video-XL is a novel approach that leverages MLLMs' inherent key-value sparsification capacity to condense the visual input.<n>Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes.
arXiv Detail & Related papers (2024-09-22T15:13:31Z) - Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [34.50993235961505]
Kangaroo is a powerful Video LMM aimed at addressing the challenges of processing long videos.
Data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning.
curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos.
arXiv Detail & Related papers (2024-08-28T05:34:14Z) - LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs [57.23637303451716]
Long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words.
We introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks.
We construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words.
arXiv Detail & Related papers (2024-08-13T17:46:12Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment.
We construct a long instruction-following dataset using Self-Instruct.
We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.