A Unified Model for Video Understanding and Knowledge Embedding with
Heterogeneous Knowledge Graph Dataset
- URL: http://arxiv.org/abs/2211.10624v2
- Date: Sun, 2 Apr 2023 03:10:21 GMT
- Title: A Unified Model for Video Understanding and Knowledge Embedding with
Heterogeneous Knowledge Graph Dataset
- Authors: Jiaxin Deng, Dong Shen, Haojie Pan, Xiangyu Wu, Ximan Liu, Gaofeng
Meng, Fan Yang, Size Li, Ruiji Fu, Zhongyuan Wang
- Abstract summary: We propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations.
Experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance.
It also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks.
- Score: 47.805378137676605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding is an important task in short video business platforms
and it has a wide application in video recommendation and classification. Most
of the existing video understanding works only focus on the information that
appeared within the video content, including the video frames, audio and text.
However, introducing common sense knowledge from the external Knowledge Graph
(KG) dataset is essential for video understanding when referring to the content
which is less relevant to the video. Owing to the lack of video knowledge graph
dataset, the work which integrates video understanding and KG is rare. In this
paper, we propose a heterogeneous dataset that contains the multi-modal video
entity and fruitful common sense relations. This dataset also provides multiple
novel video inference tasks like the Video-Relation-Tag (VRT) and
Video-Relation-Video (VRV) tasks. Furthermore, based on this dataset, we
propose an end-to-end model that jointly optimizes the video understanding
objective with knowledge graph embedding, which can not only better inject
factual knowledge into video understanding but also generate effective
multi-modal entity embedding for KG. Comprehensive experiments indicate that
combining video understanding embedding with factual knowledge benefits the
content-based video retrieval performance. Moreover, it also helps the model
generate better knowledge graph embedding which outperforms traditional
KGE-based methods on VRT and VRV tasks with at least 42.36% and 17.73%
improvement in HITS@10.
Related papers
- Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - VRAG: Region Attention Graphs for Content-Based Video Retrieval [85.54923500208041]
Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods.
VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations.
We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
arXiv Detail & Related papers (2022-05-18T16:50:45Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - VLEngagement: A Dataset of Scientific Video Lectures for Evaluating
Population-based Engagement [23.078055803229912]
Video lectures have become one of the primary modalities to impart knowledge to masses in the current digital age.
There is still an important need for data and research aimed at understanding learner engagement with scientific video lectures.
This paper introduces VLEngagement, a novel dataset that consists of content-based and video-specific features extracted from publicly available scientific video lectures.
arXiv Detail & Related papers (2020-11-02T14:20:19Z) - Knowledge-Based Visual Question Answering in Videos [36.23723122336639]
We introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom.
The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions.
Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy.
arXiv Detail & Related papers (2020-04-17T02:06:26Z) - Feature Re-Learning with Data Augmentation for Video Relevance
Prediction [35.87597969685573]
Re-learning is realized by projecting a given deep feature into a new space by an affine transformation.
We propose a new data augmentation strategy which works directly on frame-level and video-level features.
arXiv Detail & Related papers (2020-04-08T05:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.