Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection
- URL: http://arxiv.org/abs/2512.11683v1
- Date: Fri, 12 Dec 2025 16:02:42 GMT
- Title: Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection
- Authors: Qiushi Guo,
- Abstract summary: Depth Copy Paste is a multimodal and depth aware augmentation framework for face detection training.<n>It generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes.<n>Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images.<n>For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment.
- Score: 2.0813318162800702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
Related papers
- Training Self-Supervised Depth Completion Using Sparse Measurements and a Single Image [2.3874115898130865]
We propose a novel self-supervised depth completion paradigm that requires only sparse depth measurements and their corresponding image for training.<n>By leveraging the characteristics of depth distribution, we design novel loss functions that effectively propagate depth information from observed points to unobserved regions.
arXiv Detail & Related papers (2025-07-20T07:24:09Z) - Exploring Depth Information for Detecting Manipulated Face Videos [36.36293334402051]
The face depth map has shown to be effective in other areas such as face recognition or face detection.<n>We propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image.<n>The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features.
arXiv Detail & Related papers (2024-11-27T18:16:11Z) - Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging
Scenarios [103.72094710263656]
This paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework.
We propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas.
With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner.
arXiv Detail & Related papers (2024-02-19T04:39:16Z) - DeepFidelity: Perceptual Forgery Fidelity Assessment for Deepfake
Detection [67.3143177137102]
Deepfake detection refers to detecting artificially generated or edited faces in images or videos.
We propose a novel Deepfake detection framework named DeepFidelity to adaptively distinguish real and fake faces.
arXiv Detail & Related papers (2023-12-07T07:19:45Z) - COMICS: End-to-end Bi-grained Contrastive Learning for Multi-face Forgery Detection [56.7599217711363]
Face forgery recognition methods can only process one face at a time.
Most face forgery recognition methods can only process one face at a time.
We propose COMICS, an end-to-end framework for multi-face forgery detection.
arXiv Detail & Related papers (2023-08-03T03:37:13Z) - Exploring Depth Information for Face Manipulation Detection [25.01910127502075]
We propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from a RGB face image.
The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features.
arXiv Detail & Related papers (2022-12-29T09:00:22Z) - 3D Dense Geometry-Guided Facial Expression Synthesis by Adversarial
Learning [54.24887282693925]
We propose a novel framework to exploit 3D dense (depth and surface normals) information for expression manipulation.
We use an off-the-shelf state-of-the-art 3D reconstruction model to estimate the depth and create a large-scale RGB-Depth dataset.
Our experiments demonstrate that the proposed method outperforms the competitive baseline and existing arts by a large margin.
arXiv Detail & Related papers (2020-09-30T17:12:35Z) - Deep Learning-based Single Image Face Depth Data Enhancement [15.41435352543715]
This work proposes a deep learning face depth enhancement method in this context.
Deep learning enhancers yield noticeably better results than the tested preexisting enhancers.
arXiv Detail & Related papers (2020-06-19T11:52:38Z) - Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing [61.82466976737915]
Depth supervised learning has been proven as one of the most effective methods for face anti-spoofing.
We propose a new approach to detect presentation attacks from multiple frames based on two insights.
The proposed approach achieves state-of-the-art results on five benchmark datasets.
arXiv Detail & Related papers (2020-03-18T06:11:20Z) - DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation.
Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.