Hierarchical Video-Moment Retrieval and Step-Captioning
- URL: http://arxiv.org/abs/2303.16406v1
- Date: Wed, 29 Mar 2023 02:33:54 GMT
- Title: Hierarchical Video-Moment Retrieval and Step-Captioning
- Authors: Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas O\u{g}uz,
Yasher Mehdad, Mohit Bansal
- Abstract summary: HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
- Score: 68.4859260853096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is growing interest in searching for information from large video
corpora. Prior works have studied relevant tasks, such as text-based video
retrieval, moment retrieval, video summarization, and video captioning in
isolation, without an end-to-end setup that can jointly search from video
corpora and generate summaries. Such an end-to-end setup would allow for many
interesting applications, e.g., a text-based search that finds a relevant video
from a video corpus, extracts the most relevant moment from that video, and
segments the moment into important steps with captions. To address this, we
present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and
propose a new benchmark that covers hierarchical information retrieval and
visual/textual stepwise summarization from an instructional video corpus.
HiREST consists of 3.4K text-video pairs from an instructional video dataset,
where 1.1K videos have annotations of moment spans relevant to text query and
breakdown of each moment into key instruction steps with caption and timestamps
(totaling 8.6K step captions). Our hierarchical benchmark consists of video
retrieval, moment retrieval, and two novel moment segmentation and step
captioning tasks. In moment segmentation, models break down a video moment into
instruction steps and identify start-end boundaries. In step captioning, models
generate a textual summary for each step. We also present starting point
task-specific and end-to-end joint baseline models for our new benchmark. While
the baseline models show some promising results, there still exists large room
for future improvement by the community. Project website:
https://hirest-cvpr2023.github.io
Related papers
- StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.