Related papers: Edit As You Wish: Video Caption Editing with Multi-grained User Control

Edit As You Wish: Video Caption Editing with Multi-grained User Control

URL: http://arxiv.org/abs/2305.08389v2
Date: Mon, 3 Jun 2024 07:47:36 GMT
Title: Edit As You Wish: Video Caption Editing with Multi-grained User Control
Authors: Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Xu Sun, Qin Jin,
Abstract summary: We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
Score: 61.76233268900959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are ready to be open-sourced.

Related papers

Object-centric Video Question Answering with Visual Grounding and Referring [43.963739052764595]
We introduce a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks.<n>We also propose STOM, a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video.<n>We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation.
arXiv Detail & Related papers (2025-07-25T18:11:23Z)
SD-VSum: A Method and Dataset for Script-Driven Video Summarization [6.076406622352117]
We introduce the task of script-driven video summarization (VideoXum)<n>We produce natural language descriptions of the different human-annotated summaries that are available per video.<n>We develop a new network architecture for script-driven video summarization (SD-VSum)
arXiv Detail & Related papers (2025-05-06T08:47:14Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box. To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z)
Video Referring Expression Comprehension via Transformer with Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language. Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z)
DocFormerv2: Local Features for Document Understanding [15.669112678509522]
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU) The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features.
arXiv Detail & Related papers (2023-06-02T17:58:03Z)
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token [30.6086480249568]
We show a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation. We propose several new techniques that take into account the particularity of visual tasks. We achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark.
arXiv Detail & Related papers (2023-01-05T18:55:20Z)
IntentVizor: Towards Generic Query Guided Interactive Video Summarization Using Slow-Fast Graph Convolutional Networks [2.5234156040689233]
IntentVizor is an interactive video summarization framework guided by genric multi-modality queries. We use a set of intents to represent the inputs of users to design our new interactive visual analytic interface.
arXiv Detail & Related papers (2021-09-30T03:44:02Z)
Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos [60.62475495522428]
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos. We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities.
arXiv Detail & Related papers (2020-04-12T09:25:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.