Detection-Fusion for Knowledge Graph Extraction from Videos
- URL: http://arxiv.org/abs/2501.00136v1
- Date: Mon, 30 Dec 2024 20:26:11 GMT
- Title: Detection-Fusion for Knowledge Graph Extraction from Videos
- Authors: Taniya Das, Louis Mahon, Thomas Lukasiewicz,
- Abstract summary: We propose a method to annotate videos with knowledge graphs.
Specifically, we propose a deep-learning-based model for this task.
We also propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.
- Score: 49.1574468325115
- License:
- Abstract: One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Exploring External Knowledge for Accurate modeling of Visual and
Language Problems [2.7190267444272056]
This dissertation focuses on visual and language understanding which involves many challenging tasks.
The state-of-the-art methods for solving these problems usually involves only two parts: source data and target labels.
We developed a methodology that we can first extract external knowledge and then integrate it with the original models.
arXiv Detail & Related papers (2023-01-27T02:01:50Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Knowledge Graph Extraction from Videos [46.31652453979874]
We propose a new task of knowledge graph extraction from videos, producing a description in the form of a knowledge graph of the contents of a given video.
Since no datasets exist for this task, we also include a method to automatically generate them, starting from datasets where videos are annotated with natural language.
We report results on MSVD* and MSR-VTT*, two datasets obtained from MSVD and MSR-VTT using our method.
arXiv Detail & Related papers (2020-07-20T12:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.