FuguReport

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Authors Chenyang Wu, Lina Lei, Fan Li, Chun-Le Guo, Dehong Kong, Xinran Qin, Zhixin Wang, Ming-Ming Cheng, Chongyi Li
Affiliations NKIARI / Nankai University / Huawei
Categories Method / Token Selection / Mask-based essential token selection, Method / Diffusion Models / Diffusion process simulation, Evaluation / Efficiency Evaluation / Speedup in video object removal
License CC BY 4.0

Abstract Overview

This paper addresses the high inference cost of Diffusion Transformer (DiT)-based video object removal, where full spatiotemporal token processing is performed even when only small masked regions need editing. The proposed YOSE framework introduces two components: Batch Variable-length Indexing (BVI), a differentiable operator that selects only masked-region tokens and supports variable-length batching, and a Diffusion Process Simulator (DiffSim) module that approximates the influence of unmasked regions during DiT self-attention via learnable combining, scaling, and bias parameters. A fusion strategy aligns mean and variance in overlapping boundary regions to reduce artifacts between restored and preserved content. Experiments on YouTube-VOS and DAVIS demonstrate that computation scales approximately linearly with mask size, yielding up to 2.5× speedup in roughly 70% of cases (where mask ratios are below 20%) while maintaining reconstruction quality close to the full-token baseline models.

Novelty

The main novelty is a lightweight, plug-in fine-tuning framework for DiT-based video object removal that makes token processing explicitly mask-aware without redesigning the backbone architecture. Its distinctive elements are the differentiable BVI operator for variable-length essential-token selection via continuous coordinate-mapping (grid_sample), and the DiffSim module that simulates unmasked-region diffusion context through learnable combining, scaling, and bias parameters applied per DiT block, so that only masked tokens require full DiT processing.

Results

YOSE achieves 3.3× acceleration at a 5% mask ratio and 2.5× at a 20% mask ratio, with worst-case runtime converging to the baseline when most tokens are masked. Applied to MiniMax Remover, it preserves quality closely—improving background PSNR from 30.33 to 31.01 dB on YouTube-VOS with negligible metric changes on DAVIS. When applied to VACE, YOSE improves background PSNR by 5.47 dB on YouTube-VOS and 3.19 dB on DAVIS, and raises the object removal success rate from 62.2% to 97.8%.

Key Points

  1. YOSE reduces redundant DiT computation by selecting and processing only masked-region tokens through a differentiable variable-length indexing scheme (BVI) that uses continuous grid sampling to maintain gradient flow.
  2. DiffSim supplies simulated key-value context from unmasked regions via learnable combining, scaling, and bias parameters per DiT block, enabling masked-token restoration to remain semantically consistent without full-token attention over the entire video.
  3. Empirically, the method achieves mask-ratio-dependent linear acceleration (up to 3.3× at 5% mask ratio) with largely preserved visual quality, and generalizes to VACE where it additionally suppresses mask-shaped hallucination artifacts and raises the removal success rate from 62.2% to 97.8%.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.