Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
- URL: http://arxiv.org/abs/2304.04227v3
- Date: Wed, 24 May 2023 14:01:54 GMT
- Title: Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
- Authors: Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny
- Abstract summary: Video captioning to convey dynamic scenes from videos advances the understanding of using natural language.
In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive video descriptions.
- Score: 30.650879247687747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning aims to convey dynamic scenes from videos using natural
language, facilitating the understanding of spatiotemporal information within
our environment. Although there have been recent advances, generating detailed
and enriched video descriptions continues to be a substantial challenge. In
this work, we introduce Video ChatCaptioner, an innovative approach for
creating more comprehensive spatiotemporal video descriptions. Our method
employs a ChatGPT model as a controller, specifically designed to select frames
for posing video content-driven questions. Subsequently, a robust algorithm is
utilized to answer these visual queries. This question-answer framework
effectively uncovers intricate video details and shows promise as a method for
enhancing video content. Following multiple conversational rounds, ChatGPT can
summarize enriched video content based on previous conversations. We
qualitatively demonstrate that our Video ChatCaptioner can generate captions
containing more visual details about the videos. The code is publicly available
at https://github.com/Vision-CAIR/ChatCaptioner
Related papers
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions [93.29360532845062]
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions.
The series comprises: ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy.
We further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos.
arXiv Detail & Related papers (2024-06-06T17:58:54Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - VideoChat: Chat-Centric Video Understanding [80.63932941216129]
We develop an end-to-end chat-centric video understanding system, coined as VideoChat.
It integrates video foundation models and large language models via a learnable neural interface.
Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications.
arXiv Detail & Related papers (2023-05-10T17:59:04Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.