Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
- URL: http://arxiv.org/abs/2503.13821v1
- Date: Tue, 18 Mar 2025 01:57:48 GMT
- Title: Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
- Authors: Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman,
- Abstract summary: We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description.<n>The resulting video contains clips that accurately reflect all the step descriptions, while being visually coherent.<n> Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.
- Score: 51.314912554605066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe composed of multiple steps. Furthermore, simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse and novel recipes and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.
Related papers
- Mobius: Text to Seamless Looping Video Generation via Latent Shift [50.04534295458244]
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations.<n>Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training.
arXiv Detail & Related papers (2025-02-27T17:33:51Z) - Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis [9.687215124767063]
We propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene.<n>Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.
arXiv Detail & Related papers (2024-07-16T15:03:05Z) - Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.53311308617818]
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs.<n>Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos.<n>The generated imperfect summaries can already achieve competitive performance on existing video understanding tasks.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Multi Sentence Description of Complex Manipulation Action Videos [3.7486111821201287]
Existing approaches for automatic video descriptions are mostly focused on single sentence generation at a fixed level of detail.
We propose one hybrid statistical and one end-to-end framework to address this problem.
arXiv Detail & Related papers (2023-11-13T12:27:06Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.