Abstract: This paper studies video inpainting detection, which localizes an inpainted
region in a video both spatially and temporally. In particular, we introduce
VIDNet, Video Inpainting Detection Network, which contains a two-stream
encoder-decoder architecture with attention module. To reveal artifacts encoded
in compression, VIDNet additionally takes in Error Level Analysis frames to
augment RGB frames, producing multimodal features at different levels with an
encoder. Exploring spatial and temporal relationships, these features are
further decoded by a Convolutional LSTM to predict masks of inpainted regions.
In addition, when detecting whether a pixel is inpainted or not, we present a
quad-directional local attention module that borrows information from its
surrounding pixels from four directions. Extensive experiments are conducted to
validate our approach. We demonstrate, among other things, that VIDNet not only
outperforms by clear margins alternative inpainting detection methods but also
generalizes well on novel videos that are unseen during training.