Related papers: Automated Video Segmentation Machine Learning Pipeline

Related papers

Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models [76.7535001311919]
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they often fail to compose complex scenes or follow logical temporal instructions.<n>We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages.<n>Our approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2.
arXiv Detail & Related papers (2025-12-18T10:10:45Z)
DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion [62.589889759543446]
DriveGen3D is a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes.<n>Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction.
arXiv Detail & Related papers (2025-10-17T03:00:08Z)
STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z)
MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation [25.721829124345106]
We introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching.<n>By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences.<n>We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Frechet Video Distance (FVD) competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-16T18:59:11Z)
VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer [56.81599836980222]
We propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images.<n>Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, and start-end timestamps for temporal control, and (ii) VFX Creator, a controllable VFX generation framework based on a Video Diffusion Transformer.
arXiv Detail & Related papers (2025-02-09T18:12:25Z)
Video Set Distillation: Information Diversification and Temporal Densification [68.85010825225528]
Video textbfsets have two dimensions of redundancies: within-sample and inter-sample redundancies.<n>We are the first to study Video Set Distillation, which synthesizes optimized video data by addressing within-sample and inter-sample redundancies.
arXiv Detail & Related papers (2024-11-28T05:37:54Z)
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.<n>First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.<n>Second, we present MotionAura, a text-to-video generation framework.<n>Third, we propose a spectral transformer-based denoising network.<n>Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing. T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z)
ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams. To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC) Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.