Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding
- URL: http://arxiv.org/abs/2310.12724v1
- Date: Thu, 19 Oct 2023 13:26:02 GMT
- Title: Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding
- Authors: Yuanxing Xu, Yuting Wei and Bin Wu
- Abstract summary: Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
- Score: 15.697251303126874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The surge in video and social media content underscores the need for a deeper
understanding of multimedia data. Most of the existing mature video
understanding techniques perform well with short formats and content that
requires only shallow understanding, but do not perform well with long format
videos that require deep understanding and reasoning. Deep Video Understanding
(DVU) Challenge aims to push the boundaries of multimodal extraction, fusion,
and analytics to address the problem of holistically analyzing long videos and
extract useful knowledge to solve different types of queries. This paper
introduces a query-aware method for long video localization and relation
discrimination, leveraging an imagelanguage pretrained model. This model
adeptly selects frames pertinent to queries, obviating the need for a complete
movie-level knowledge graph. Our approach achieved first and fourth positions
for two groups of movie-level queries. Sufficient experiments and final
rankings demonstrate its effectiveness and robustness.
Related papers
- OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands.
We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries.
It features an Divide-and-Conquer Loop capable of autonomous reasoning.
We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z) - DrVideo: Document Retrieval Based Long Video Understanding [44.34473173458403]
DrVideo is a document-retrieval-based system designed for long video understanding.
It transforms a long video into a text-based long document to retrieve key frames and augment the information of these frames.
It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions.
arXiv Detail & Related papers (2024-06-18T17:59:03Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.