O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
Video Captioning
- URL: http://arxiv.org/abs/2108.02359v1
- Date: Thu, 5 Aug 2021 04:17:20 GMT
- Title: O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
Video Captioning
- Authors: Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, Xu Sun
- Abstract summary: We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning.
O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption.
Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
- Score: 41.14313691818424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning combines video understanding and language generation.
Different from image captioning that describes a static image with details of
almost every object, video captioning usually considers a sequence of frames
and biases towards focused objects, e.g., the objects that stay in focus
regardless of the changing background. Therefore, detecting and properly
accommodating focused objects is critical in video captioning. To enforce the
description of focused objects and achieve controllable video captioning, we
propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs
caption generation in three steps: 1) identify the focused objects and predict
their locations in the target caption; 2) generate the related attribute words
and relation words of these focused objects to form a draft caption; and 3)
combine video information to refine the draft caption to a fluent final
caption. Since the focused objects are generated and located ahead of other
words, it is difficult to apply the word-by-word autoregressive generation
process; instead, we adopt a non-autoregressive approach. The experiments on
two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness
of O2NA, which achieves results competitive with the state-of-the-arts but with
both higher diversity and higher inference speed.
Related papers
- Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene.
Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object.
We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z) - SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box.
To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Hierarchical Modular Network for Video Captioning [162.70349114104107]
We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions.
The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
arXiv Detail & Related papers (2021-11-24T13:07:05Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.