CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
- URL: http://arxiv.org/abs/2512.22969v1
- Date: Sun, 28 Dec 2025 15:21:20 GMT
- Title: CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
- Authors: Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki,
- Abstract summary: We propose CLIP-Joint-Detect, a framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training.<n>A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term.<n>We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS 2017 benchmark using modern YOLO detectors (YOLOv11)
- Score: 0.08699280339422537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
Related papers
- Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z) - CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment [28.2773807732662]
Large-scale natural image-text datasets often suffer from loose semantic alignment due to weak supervision.<n>We propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures.<n>Two shared robustness pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning.
arXiv Detail & Related papers (2025-08-08T16:23:05Z) - Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection [25.349261412750586]
This study introduces textbfFiSeCLIP for ZSAD with training-free textbfCLIP, combining the feature matching with the cross-modal alignment.<n>Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks.
arXiv Detail & Related papers (2025-07-15T05:42:17Z) - Refining CLIP's Spatial Awareness: A Visual-Centric Perspective [10.936397225984107]
Contrastive Language-Image Pre-training excels in global alignment with language but exhibits limited sensitivity to spatial information.<n>Recent approaches have introduced Region-Language Alignment to enhance CLIP's performance in dense multimodal tasks.<n>We propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation.
arXiv Detail & Related papers (2025-04-03T07:04:56Z) - Hybrid Multi-Stage Learning Framework for Edge Detection: A Survey [0.0]
This paper introduces a Hybrid Multi-Stage Learning Framework that integrates Convolutional Neural Network (CNN) feature extraction with a Support Vector Machine (SVM)<n>Our approach decouples feature representation and classification stages, enhancing robustness and interpretability.
arXiv Detail & Related papers (2025-03-26T13:06:31Z) - C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.<n>Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning.
Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z) - The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI.
We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts.
We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.