A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks
- URL: http://arxiv.org/abs/2602.16322v1
- Date: Wed, 18 Feb 2026 10:02:30 GMT
- Title: A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks
- Authors: Santiago C. Vilabella, Pablo Pérez-Núñez, Beatriz Remeseiro,
- Abstract summary: This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge.<n>We present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet.
- Score: 1.433758865948252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.
Related papers
- A Survey on Efficient Vision-Language-Action Models [153.11669266922993]
Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction.<n>Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models.
arXiv Detail & Related papers (2025-10-27T17:57:33Z) - ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution [77.86222359025011]
We propose ToolACE-DEV, a self-improving framework for tool learning.<n>First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities.<n>We then introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs.
arXiv Detail & Related papers (2025-05-12T12:48:30Z) - A Survey on Remote Sensing Foundation Models: From Vision to Multimodality [35.532200523631765]
Vision and multimodal foundation models for remote sensing have significantly improved the capabilities of intelligent geospatial data interpretation.<n>The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models.<n>This paper provides a review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios.
arXiv Detail & Related papers (2025-03-28T01:57:35Z) - Vision Foundation Models in Remote Sensing: A Survey [6.036426846159163]
Foundation models are large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency.<n>This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.
arXiv Detail & Related papers (2024-08-06T22:39:34Z) - A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance.
We propose a simple yet effective data augmentation approach by leveraging advancements in generative models.
Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z) - On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning [33.89483627891117]
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency.
Open-source models handle general image tasks effectively, but face challenges with the high computational demands of complex visually-situated text understanding.
This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs.
arXiv Detail & Related papers (2024-06-17T17:57:30Z) - Towards In-Vehicle Multi-Task Facial Attribute Recognition:
Investigating Synthetic Data and Vision Foundation Models [8.54530542456452]
We investigate the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle.
Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context.
arXiv Detail & Related papers (2024-03-10T04:17:54Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.