Related papers: RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

URL: http://arxiv.org/abs/2405.18406v3
Date: Thu, 31 Oct 2024 23:27:09 GMT
Title: RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
Authors: Jaehong Yoon, Shoubin Yu, Mohit Bansal,
Abstract summary: This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
Score: 58.15403987979496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.

Related papers

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing [21.525921468472685]
We present a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing.<n>Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.<n>We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation.
arXiv Detail & Related papers (2026-02-09T15:56:05Z)
Controllable Hybrid Captioner for Improved Long-form Video Understanding [0.24578723416255746]
Video data is extremely dense and high-dimensional.<n>Text-based summaries of video content offer a way to represent content in a much more compact manner than raw.<n>We introduce Vision Language Models (VLMs) to enrich the memory with static scene descriptions.
arXiv Detail & Related papers (2025-07-22T22:09:00Z)
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage. Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z)
Text-Video Multi-Grained Integration for Video Moment Montage [13.794791614348084]
A new task called Video Moment Montage (VMM) aims to accurately locate the corresponding video segments based on a pre-provided narration text. We present a novel textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features.
arXiv Detail & Related papers (2024-12-12T13:40:59Z)
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework. It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback. VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z)
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [41.640692114423544]
We introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods.
arXiv Detail & Related papers (2024-03-21T15:15:00Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
TaleCrafter: Interactive Story Visualization with Multiple Characters [49.14122401339003]
This paper proposes a system for generic interactive story visualization. It is capable of handling multiple novel characters and supporting the editing of layout and local structure. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
arXiv Detail & Related papers (2023-05-29T17:11:39Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Video-P2P: Video Editing with Cross-attention Control [68.64804243427756]
Video-P2P is a novel framework for real-world video editing with cross-attention control. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes.
arXiv Detail & Related papers (2023-03-08T17:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.