Related papers: Event and Entity Extraction from Generated Video Captions

Event and Entity Extraction from Generated Video Captions

URL: http://arxiv.org/abs/2211.02982v3
Date: Wed, 13 Sep 2023 14:49:23 GMT
Title: Event and Entity Extraction from Generated Video Captions
Authors: Johannes Scherer and Ansgar Scherp and Deepayan Bhowmik
Abstract summary: We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models to generate captions for videos of the ActivityNet Captions dataset.
Score: 4.987670632802288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.

Related papers

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors [40.48528326378281]
Dense video captioning aims to interpret and describe all temporally localized events throughout an input video.<n>Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data.<n>We propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding.
arXiv Detail & Related papers (2026-01-06T10:45:53Z)
VoCap: Video Object Captioning and Segmentation from Any Prompt [78.90048335805047]
VoCap is a flexible model that consumes a video segmentation and a prompt understanding of various modalities.<n>It addresses promptable video object segmentation, referring, and object captioning.<n>Our model yields state-the-art results on referring expression video object segmentation.
arXiv Detail & Related papers (2025-08-29T17:43:58Z)
Controllable Hybrid Captioner for Improved Long-form Video Understanding [0.24578723416255746]
Video data is extremely dense and high-dimensional.<n>Text-based summaries of video content offer a way to represent content in a much more compact manner than raw.<n>We introduce Vision Language Models (VLMs) to enrich the memory with static scene descriptions.
arXiv Detail & Related papers (2025-07-22T22:09:00Z)
Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z)
SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box. To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z)
Video Summarization: Towards Entity-Aware Captions [73.28063602552741]
We propose the task of summarizing news video directly to entity-aware captions. We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z)
VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order. Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z)
CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning [34.46948978082648]
ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions. We introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes.
arXiv Detail & Related papers (2021-11-30T06:37:44Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video. gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features. Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
Exploration of Visual Features and their weighted-additive fusion for Video Captioning [0.7388859384645263]
Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context.
arXiv Detail & Related papers (2021-01-14T07:21:13Z)
QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input. We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.