IntentVizor: Towards Generic Query Guided Interactive Video
Summarization Using Slow-Fast Graph Convolutional Networks
- URL: http://arxiv.org/abs/2109.14834v1
- Date: Thu, 30 Sep 2021 03:44:02 GMT
- Title: IntentVizor: Towards Generic Query Guided Interactive Video
Summarization Using Slow-Fast Graph Convolutional Networks
- Authors: Guande Wu and Jianzhe Lin and Claudio T. Silva
- Abstract summary: IntentVizor is an interactive video summarization framework guided by genric multi-modality queries.
We use a set of intents to represent the inputs of users to design our new interactive visual analytic interface.
- Score: 2.5234156040689233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The target of automatic Video summarization is to create a short skim of the
original long video while preserving the major content/events. There is a
growing interest in the integration of user's queries into video summarization,
or query-driven video summarization. This video summarization method predicts a
concise synopsis of the original video based on the user query, which is
commonly represented by the input text. However, two inherent problems exist in
this query-driven way. First, the query text might not be enough to describe
the exact and diverse needs of the user. Second, the user cannot edit once the
summaries are produced, limiting this summarization technique's practical
value. We assume the needs of the user should be subtle and need to be adjusted
interactively. To solve these two problems, we propose a novel IntentVizor
framework, which is an interactive video summarization framework guided by
genric multi-modality queries. The input query that describes the user's needs
is not limited to text but also the video snippets. We further conclude these
multi-modality finer-grained queries as user `intent', which is a newly
proposed concept in this paper. This intent is interpretable, interactable, and
better quantifies/describes the user's needs. To be more specific, We use a set
of intents to represent the inputs of users to design our new interactive
visual analytic interface. Users can interactively control and adjust these
mixed-initiative intents to obtain a more satisfying summary of this newly
proposed interface. Also, as algorithms help users achieve their summarization
goal via video understanding, we propose two novel intent/scoring networks
based on the slow-fast feature for our algorithm part. We conduct our
experiments on two benchmark datasets. The comparison with the state-of-the-art
methods verifies the effectiveness of the proposed framework.
Related papers
- Your Interest, Your Summaries: Query-Focused Long Video Summarization [0.6041235048439966]
This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries.
We propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task.
arXiv Detail & Related papers (2024-10-17T23:37:58Z) - Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small.
Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels.
Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z) - Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
arXiv Detail & Related papers (2023-05-15T07:12:19Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.