Subject-Oriented Video Captioning
- URL: http://arxiv.org/abs/2312.13330v1
- Date: Wed, 20 Dec 2023 17:44:32 GMT
- Title: Subject-Oriented Video Captioning
- Authors: Yunchuan Ma, Chang Teng, Yuankai Qi, Guorong Li, Laiyu Qing, Qi Wu,
and Qingming Huang
- Abstract summary: We propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.
We construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT.
As the first attempt, we evaluate four state-of-the-art general video captioning models, and have observed a large performance drop.
- Score: 64.08594243670296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Describing video content according to users' needs is a long-held goal.
Although existing video captioning methods have made significant progress, the
generated captions may not focus on the entity that users are particularly
interested in. To address this problem, we propose a new video captioning task,
subject-oriented video captioning, which allows users to specify the describing
target via a bounding box. To support this task, we construct two
subject-oriented video captioning datasets based on two widely used video
captioning datasets: MSVD and MSRVTT, by annotating subjects in each video for
each caption. These datasets pave the way for future technique development. As
the first attempt, we evaluate four state-of-the-art general video captioning
models, and have observed a large performance drop. We then explore several
strategies to enable them to describe the desired target. Experimental results
show obvious improvement, but there is still a large room for further
exploration in this field.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Video Summarization: Towards Entity-Aware Captions [75.71891605682931]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning [93.6842670770983]
Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
arXiv Detail & Related papers (2023-02-27T19:53:49Z) - Visual Subtitle Feature Enhanced Video Outline Generation [23.831220964676973]
We introduce a novel video understanding task, namely video outline generation (VOG)
To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
We propose a Visual Subtitle feature Enhanced video outline generation model (VSENet)
arXiv Detail & Related papers (2022-08-24T05:26:26Z) - O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
Video Captioning [41.14313691818424]
We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning.
O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption.
Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
arXiv Detail & Related papers (2021-08-05T04:17:20Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input.
We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.