DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2503.09271v1
- Date: Wed, 12 Mar 2025 11:15:34 GMT
- Title: DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
- Authors: Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara,
- Abstract summary: DitHub is a framework designed to create and manage a library of efficient adaptation modules.<n>Inspired by Version Control Systems, DitHub organizes expert modules like branches that can be fetched and merged as needed.<n>Our approach achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark.
- Score: 32.77455136447568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary object detectors can recognize a wide range of categories using simple textual prompts. However, improving their ability to detect rare classes or specialize in certain domains remains a challenge. While most recent methods rely on a single set of model weights for adaptation, we take a different approach by using modular deep learning. We introduce DitHub, a framework designed to create and manage a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub organizes expert modules like branches that can be fetched and merged as needed. This modular approach enables a detailed study of how adaptation modules combine, making it the first method to explore this aspect in Object Detection. Our approach achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to evaluate how well models adapt when previously seen classes reappear. For more details, visit our project page: https://aimagelab.github.io/DitHub/
Related papers
- OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection [86.30994231610651]
Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos.
We propose textbfOpenTAD, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular framework.
Minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two.
arXiv Detail & Related papers (2025-02-27T18:32:27Z) - Towards Compatible Fine-tuning for Vision-Language Model Updates [114.25776195225494]
Class-conditioned Context Optimization (ContCoOp) integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder.<n>Our experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
arXiv Detail & Related papers (2024-12-30T12:06:27Z) - GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules.
The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension.
It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z) - FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models? [14.582209994281374]
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
arXiv Detail & Related papers (2023-07-09T08:07:43Z) - ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models.
Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z) - FrOoDo: Framework for Out-of-Distribution Detection [1.3270838622986498]
FrOoDo is an easy-to-use framework for Out-of-Distribution detection tasks in digital pathology.
It can be used with PyTorch classification and segmentation models, and its modular design allows for easy extension.
arXiv Detail & Related papers (2022-08-01T16:11:21Z) - MM-FSOD: Meta and metric integrated few-shot object detection [14.631208179789583]
We present an effective object detection framework (MM-FSOD) that integrates metric learning and meta-learning.
Our model is a class-agnostic detection model that can accurately recognize new categories, which are not appearing in training samples.
arXiv Detail & Related papers (2020-12-30T14:02:52Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.