VideoChat: Chat-Centric Video Understanding
- URL: http://arxiv.org/abs/2305.06355v2
- Date: Thu, 4 Jan 2024 02:06:07 GMT
- Title: VideoChat: Chat-Centric Video Understanding
- Authors: KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali
Wang, Limin Wang, Yu Qiao
- Abstract summary: We develop an end-to-end chat-centric video understanding system, coined as VideoChat.
It integrates video foundation models and large language models via a learnable neural interface.
Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications.
- Score: 80.63932941216129
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we initiate an attempt of developing an end-to-end
chat-centric video understanding system, coined as VideoChat. It integrates
video foundation models and large language models via a learnable neural
interface, excelling in spatiotemporal reasoning, event localization, and
causal relationship inference. To instructively tune this system, we build a
video-centric instruction dataset, composed of thousands of videos associated
with detailed descriptions and conversations. This dataset emphasizes
spatiotemporal reasoning and captures causal relationships, providing a
valuable asset for training our chat-centric video understanding system.
Preliminary qualitative experiments demonstrate the potential of our system
across a broad spectrum of video applications, which could serve as a simple
prototype system for future research on chat-centric video understanding.
Access our code and data at https://github.com/OpenGVLab/Ask-Anything
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System [119.51012668709502]
We present our vision for multimodal and versatile video understanding and propose a prototype system, system.
Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit.
All the detected tracklets are stored in a database and interact with the user through a database manager.
arXiv Detail & Related papers (2023-04-27T17:59:58Z) - Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions [30.650879247687747]
Video captioning to convey dynamic scenes from videos advances the understanding of using natural language.
In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive video descriptions.
arXiv Detail & Related papers (2023-04-09T12:46:18Z) - HLVU : A New Challenge to Test Deep Understanding of Movies the Way
Humans do [3.423039905282442]
We propose a new evaluation challenge and direction in the area of High-level Video Understanding.
The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other.
A pilot High-Level Video Understanding dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them.
A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts.
arXiv Detail & Related papers (2020-05-01T15:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.