Video Timeline Modeling For News Story Understanding
- URL: http://arxiv.org/abs/2309.13446v2
- Date: Fri, 27 Oct 2023 18:38:38 GMT
- Title: Video Timeline Modeling For News Story Understanding
- Authors: Meng Liu, Mingda Zhang, Jialu Liu, Hanjun Dai, Ming-Hsuan Yang,
Shuiwang Ji, Zheyun Feng, Boqing Gong
- Abstract summary: We present a novel problem, namely video timeline modeling.
Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told.
This problem has significant potential in various real-world applications, for instance, news story summarization.
- Score: 123.03394373132353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel problem, namely video timeline modeling.
Our objective is to create a video-associated timeline from a set of videos
related to a specific topic, thereby facilitating the content and structure
understanding of the story being told. This problem has significant potential
in various real-world applications, for instance, news story summarization. To
bootstrap research in this area, we curate a realistic benchmark dataset,
YouTube-News-Timeline, consisting of over $12$k timelines and $300$k YouTube
news videos. Additionally, we propose a set of quantitative metrics to
comprehensively evaluate and compare methodologies. With such a testbed, we
further develop and benchmark several deep learning approaches to tackling this
problem. We anticipate that this exploratory work will pave the way for further
research in video timeline modeling. The assets are available via
https://github.com/google-research/google-research/tree/master/video_timeline_modeling.
Related papers
- Manipulating a Tetris-Inspired 3D Video Representation [0.0]
Video algorithm is a technique that performs video compression in a way that preserves the activity in the video.
We discuss different object-temporal data representations suitable for different applications.
We explore the application of a packing algorithm to solve the problem of video synopsis.
arXiv Detail & Related papers (2024-07-11T22:41:14Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - What is More Likely to Happen Next? Video-and-Language Future Event
Prediction [111.93601253692165]
Given a video with aligned dialogue, people can often infer what is more likely to happen next.
In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions.
We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
arXiv Detail & Related papers (2020-10-15T19:56:47Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.