Related papers: DinoLizer: Learning from the Best for Generative Inpainting Localization

DinoLizer: Learning from the Best for Generative Inpainting Localization

URL: http://arxiv.org/abs/2511.20722v1
Date: Tue, 25 Nov 2025 08:37:24 GMT
Title: DinoLizer: Learning from the Best for Generative Inpainting Localization
Authors: Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas,
Abstract summary: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting.<n>Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset.<n>DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing.
Score: 11.535245730074285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

Related papers

From Editor to Dense Geometry Estimator [77.21804448599009]
We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
arXiv Detail & Related papers (2025-09-04T15:58:50Z)
Unified Human Localization and Trajectory Prediction with Monocular Vision [64.19384064365431]
MonoTransmotion is a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks.<n>We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs.
arXiv Detail & Related papers (2025-03-05T14:18:39Z)
SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition [4.540127373592404]
Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems.<n>This paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR.
arXiv Detail & Related papers (2025-02-28T03:05:30Z)
Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT) In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z)
IterInv: Iterative Inversion for Pixel-Level T2I Models [16.230193725587807]
DDIM inversion is a prevalent practice rooted in Latent Diffusion Models (LDM) Large pretrained T2I models working on the latent space suffer from losing details due to the first compression stage with an autoencoder mechanism. We develop an iterative inversion (IterInv) technique for this category of T2I models and verify IterInv with the open-source DeepFloyd-IF model.
arXiv Detail & Related papers (2023-10-30T13:47:46Z)
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z)
N2V2 -- Fixing Noise2Void Checkerboard Artifacts with Modified Sampling Strategies and a Tweaked Network Architecture [66.03918859810022]
We present two modifications to the vanilla N2V setup that both help to reduce the unwanted artifacts considerably. We validate our modifications on a range of microscopy and natural image data.
arXiv Detail & Related papers (2022-11-15T21:12:09Z)
GradViT: Gradient Inversion of Vision Transformers [83.54779732309653]
We demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks. We introduce a method, named GradViT, that optimize random noise into naturally looking images. We observe unprecedentedly high fidelity and closeness to the original (hidden) data.
arXiv Detail & Related papers (2022-03-22T17:06:07Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.