ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System
- URL: http://arxiv.org/abs/2304.14407v2
- Date: Sat, 29 Apr 2023 03:48:26 GMT
- Title: ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System
- Authors: Junke Wang and Dongdong Chen and Chong Luo and Xiyang Dai and Lu Yuan
and Zuxuan Wu and Yu-Gang Jiang
- Abstract summary: We present our vision for multimodal and versatile video understanding and propose a prototype system, system.
Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit.
All the detected tracklets are stored in a database and interact with the user through a database manager.
- Score: 119.51012668709502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing deep video models are limited by specific tasks, fixed input-output
spaces, and poor generalization capabilities, making it difficult to deploy
them in real-world scenarios. In this paper, we present our vision for
multimodal and versatile video understanding and propose a prototype system,
\system. Our system is built upon a tracklet-centric paradigm, which treats
tracklets as the basic video unit and employs various Video Foundation Models
(ViFMs) to annotate their properties e.g., appearance, motion, \etc. All the
detected tracklets are stored in a database and interact with the user through
a database manager. We have conducted extensive case studies on different types
of in-the-wild videos, which demonstrates the effectiveness of our method in
answering various video-related problems. Our project is available at
https://www.wangjunke.info/ChatVideo/
Related papers
- Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos [35.974750867072345]
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos.
We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence.
We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models.
arXiv Detail & Related papers (2024-08-26T17:58:47Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - VideoChat: Chat-Centric Video Understanding [80.63932941216129]
We develop an end-to-end chat-centric video understanding system, coined as VideoChat.
It integrates video foundation models and large language models via a learnable neural interface.
Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications.
arXiv Detail & Related papers (2023-05-10T17:59:04Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.