Related papers: Your Interest, Your Summaries: Query-Focused Long Video Summarization

Your Interest, Your Summaries: Query-Focused Long Video Summarization

URL: http://arxiv.org/abs/2410.14087v1
Date: Thu, 17 Oct 2024 23:37:58 GMT
Title: Your Interest, Your Summaries: Query-Focused Long Video Summarization
Authors: Nirav Patel, Payal Prajapati, Maitrik Shah,
Abstract summary: This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries. We propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task.
Score: 0.6041235048439966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating a concise and informative video summary from a long video is important, yet subjective due to varying scene importance. Users' ability to specify scene importance through text queries enhances the relevance of such summaries. This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries. To this end, we propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task. Leveraging temporal convolutional and attention mechanisms, our model effectively extracts and highlights relevant content based on user-specified queries. Experimental validation on a benchmark dataset for query-focused video summarization demonstrates the effectiveness of our approach.

Related papers

HierSum: A Global and Local Attention Mechanism for Video Summarization [14.88934924520362]
We focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments. HierSum integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions. We show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation.
arXiv Detail & Related papers (2025-04-25T20:30:30Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z)
VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video. The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z)
CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another. Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs. We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module. In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.