High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection
- URL: http://arxiv.org/abs/2511.08018v1
- Date: Wed, 12 Nov 2025 01:34:16 GMT
- Title: High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection
- Authors: Zhiyuan Chen, Yuelin Guo, Zitong Huang, Haoyu He, Renhao Lu, Weizhe Zhang,
- Abstract summary: Existing object detection methods suffer from simplistic prompts, poor image quality, and weak supervision.<n>We propose Cascade HQP-DETR to address these limitations.<n>First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets.<n>Second, our High-Quality Proposal guided query encodings object queries with image-specific priors from SAM-generated proposals.<n>Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers.
- Score: 20.075203668387136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04\% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture's universal applicability.
Related papers
- ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting [63.138778159026934]
We propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO.<n> ERGO dynamically estimates the view-specific excess risk and adaptively adjust loss weights during optimization.<n>Experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.
arXiv Detail & Related papers (2026-02-10T20:44:43Z) - EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model [56.53617289548353]
EchoGen is a pioneering framework that empowers Visual Auto-Regressive ( VAR) models with subject-driven generation capabilities.<n>We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition.<n>To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models.
arXiv Detail & Related papers (2025-09-30T11:45:48Z) - Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection [30.77558600436759]
ARAS is a language-conditioned, auto-regressive anomaly synthesis approach.<n>It injects local, text-specified defects into normal images via token-anchored latent editing.<n>It significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies.
arXiv Detail & Related papers (2025-08-05T15:07:32Z) - MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation [0.6144680854063939]
We introduce MonoVQD, a novel framework designed to advance DETR-based monocular 3D detection.<n>Mask Separated Self-Attention mechanism enables the integration of the denoising process into a DETR architecture.<n>We present the Variational Query Denoising technique to address the vanishing problem of conventional denoising methods.<n> Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark.
arXiv Detail & Related papers (2025-06-14T14:49:12Z) - MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection [30.77558600436759]
We introduce a novel and lightweight pipeline that generates synthetic anomalies through Math-Phys model guidance.<n>Our method produces realistic defect masks, which are subsequently enhanced in two phases.<n>To validate our method, we conduct experiments on three anomaly detection benchmarks: MVTec AD, VisA, and BTAD.
arXiv Detail & Related papers (2025-04-17T14:22:27Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video [30.89206445146674]
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: reliance on noise-free data.<n>We tackle three core challenges: scalable data generation, comprehensive robustness, and model enhancement.<n>We create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation.
arXiv Detail & Related papers (2025-01-24T08:25:48Z) - Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection [59.33188668341604]
3D object detection serves as the fundamental task of autonomous driving perception.
It is costly to obtain high-quality annotations for point cloud data.
We propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes.
arXiv Detail & Related papers (2024-05-27T17:59:23Z) - Synthetic Data Supervised Salient Object Detection [40.991558165686136]
We propose a novel yet effective method for SOD, coined SODGAN, which can generate infinite high-quality image-mask pairs.
For the first time, our SODGAN tackles SOD with synthetic data directly generated from the generative model.
Our approach achieves a new SOTA performance in semi/weakly-supervised methods, and even outperforms several fully-supervised SOTA methods.
arXiv Detail & Related papers (2022-10-25T08:36:29Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z) - Attention Based Real Image Restoration [48.933507352496726]
Deep convolutional neural networks perform better on images containing synthetic degradations.
This paper proposes a novel single-stage blind real image restoration network (R$2$Net)
arXiv Detail & Related papers (2020-04-26T04:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.