Towards Long Video Understanding via Fine-detailed Video Story Generation
- URL: http://arxiv.org/abs/2412.06182v2
- Date: Wed, 11 Dec 2024 11:07:35 GMT
- Title: Towards Long Video Understanding via Fine-detailed Video Story Generation
- Authors: Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang, Mingkui Tan,
- Abstract summary: Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval.
Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy.
We introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations.
- Score: 58.31050916006673
- License:
- Abstract: Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.
Related papers
- SEAL: Semantic Attention Learning for Long Video Representation [31.994155533019843]
This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos.
To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities.
Our representation is versatile, enabling applications across various long video understanding tasks.
arXiv Detail & Related papers (2024-12-02T18:46:12Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding [52.696422425058245]
MultiModal Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks.
Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding.
arXiv Detail & Related papers (2024-09-27T17:38:36Z) - OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands.
We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries.
It features an Divide-and-Conquer Loop capable of autonomous reasoning.
We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z) - Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.53311308617818]
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs.
Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos.
The generated imperfect summaries can already achieve competitive performance on existing video understanding tasks.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
arXiv Detail & Related papers (2023-10-19T13:26:02Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Contextual Explainable Video Representation:\\Human Perception-based
Understanding [10.172332586182792]
We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment.
We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
arXiv Detail & Related papers (2022-12-12T19:29:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.