Related papers: Detours for Navigating Instructional Videos

Detours for Navigating Instructional Videos

URL: http://arxiv.org/abs/2401.01823v2
Date: Sat, 4 May 2024 16:44:32 GMT
Title: Detours for Navigating Instructional Videos
Authors: Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman,
Abstract summary: We propose VidDetours, a video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's. We show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
Score: 58.1645668396789
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.

Related papers

Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT) VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z)
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime. This neglects the richness and variety of possible valid descriptions of a video. We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z)
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos. To improve generalization, we show that one model can be trained with multiple text styles. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation [75.60413443783953]
We present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC) Our data and code are available at https://github.com/THU-KEG/goal.
arXiv Detail & Related papers (2023-03-26T08:43:36Z)
Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Straight to the Point: Fast-forwarding Videos via Reinforcement Learning Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos. Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video. We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.