Edit As You Wish: Video Caption Editing with Multi-grained User Control
- URL: http://arxiv.org/abs/2305.08389v2
- Date: Mon, 3 Jun 2024 07:47:36 GMT
- Title: Edit As You Wish: Video Caption Editing with Multi-grained User Control
- Authors: Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Xu Sun, Qin Jin,
- Abstract summary: We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
- Score: 61.76233268900959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are ready to be open-sourced.
Related papers
- OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - Subject-Oriented Video Captioning [64.08594243670296]
We propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.
We construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT.
As the first attempt, we evaluate four state-of-the-art general video captioning models, and have observed a large performance drop.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - DocFormerv2: Local Features for Document Understanding [15.669112678509522]
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU)
The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form.
Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features.
arXiv Detail & Related papers (2023-06-02T17:58:03Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - IntentVizor: Towards Generic Query Guided Interactive Video
Summarization Using Slow-Fast Graph Convolutional Networks [2.5234156040689233]
IntentVizor is an interactive video summarization framework guided by genric multi-modality queries.
We use a set of intents to represent the inputs of users to design our new interactive visual analytic interface.
arXiv Detail & Related papers (2021-09-30T03:44:02Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z) - YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in
Domain-Specific Videos [60.62475495522428]
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos.
We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities.
arXiv Detail & Related papers (2020-04-12T09:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.