YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
Abstract Overview
This paper addresses the high inference cost of Diffusion Transformer (DiT)-based video object removal, where full spatiotemporal token processing is performed even when only small masked regions need editing. The proposed YOSE framework introduces two components: Batch Variable-length Indexing (BVI), a differentiable operator that selects only masked-region tokens and supports variable-length batching, and a Diffusion Process Simulator (DiffSim) module that approximates the influence of unmasked regions during DiT self-attention via learnable combining, scaling, and bias parameters. A fusion strategy aligns mean and variance in overlapping boundary regions to reduce artifacts between restored and preserved content. Experiments on YouTube-VOS and DAVIS demonstrate that computation scales approximately linearly with mask size, yielding up to 2.5× speedup in roughly 70% of cases (where mask ratios are below 20%) while maintaining reconstruction quality close to the full-token baseline models.
Novelty
The main novelty is a lightweight, plug-in fine-tuning framework for DiT-based video object removal that makes token processing explicitly mask-aware without redesigning the backbone architecture. Its distinctive elements are the differentiable BVI operator for variable-length essential-token selection via continuous coordinate-mapping (grid_sample), and the DiffSim module that simulates unmasked-region diffusion context through learnable combining, scaling, and bias parameters applied per DiT block, so that only masked tokens require full DiT processing.
Results
YOSE achieves 3.3× acceleration at a 5% mask ratio and 2.5× at a 20% mask ratio, with worst-case runtime converging to the baseline when most tokens are masked. Applied to MiniMax Remover, it preserves quality closely—improving background PSNR from 30.33 to 31.01 dB on YouTube-VOS with negligible metric changes on DAVIS. When applied to VACE, YOSE improves background PSNR by 5.47 dB on YouTube-VOS and 3.19 dB on DAVIS, and raises the object removal success rate from 62.2% to 97.8%.
Key Points
- YOSE reduces redundant DiT computation by selecting and processing only masked-region tokens through a differentiable variable-length indexing scheme (BVI) that uses continuous grid sampling to maintain gradient flow.
- DiffSim supplies simulated key-value context from unmasked regions via learnable combining, scaling, and bias parameters per DiT block, enabling masked-token restoration to remain semantically consistent without full-token attention over the entire video.
- Empirically, the method achieves mask-ratio-dependent linear acceleration (up to 3.3× at 5% mask ratio) with largely preserved visual quality, and generalizes to VACE where it additionally suppresses mask-shaped hallucination artifacts and raises the removal success rate from 62.2% to 97.8%.