Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach
- URL: http://arxiv.org/abs/2503.04918v4
- Date: Tue, 15 Apr 2025 09:19:42 GMT
- Title: Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach
- Authors: Aysegul Ucar, Soumyadeep Ro, Sanapala Satwika, Pamarthi Yasoda Gayathri, Mohmmad Ghaith Balsha,
- Abstract summary: We fine-tuned the Florence2 model for object detection tasks in non-constructed, complex environments.<n> optimized Florence2 models exhibited significant improvements in object detection accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-Language Models (VLMs) have emerged as powerful tools in artificial intelli-gence, capable of integrating textual and visual data for a unified understanding of complex scenes. While models such as Florence2, built on transformer architectures, have shown promise across general tasks, their performance in object detection within unstructured or cluttered environments remains underexplored. In this study, we fi-ne-tuned the Florence2 model for object detection tasks in non-constructed, complex environments. A comprehensive experimental framework was established involving multiple hardware configurations (NVIDIA T4, L4, and A100 GPUs), optimizers (AdamW, SGD), and varied hyperparameters including learning rates and LoRA (Low-Rank Adaptation) setups. Model training and evaluation were conducted on challenging datasets representative of real-world, disordered settings. The optimized Florence2 models exhibited significant improvements in object detection accuracy, with Mean Average Precision (mAP) metrics approaching or matching those of estab-lished models such as YOLOv8, YOLOv9, and YOLOv10. The integration of LoRA and careful fine-tuning of transformer layers contributed notably to these gains. Our find-ings highlight the adaptability of transformer-based VLMs like Florence2 for do-main-specific tasks, particularly in visually complex environments. The study under-scores the potential of fine-tuned VLMs to rival traditional convolution-based detec-tors, offering a flexible and scalable approach for advanced vision applications in re-al-world, unstructured settings.
Related papers
- Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z) - Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection [6.895355763564631]
This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones.<n>The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network.<n>By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects.
arXiv Detail & Related papers (2025-11-07T18:38:00Z) - RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models [48.91205564876609]
We propose a cost-effective and highly adaptable distillation framework to enhance lightweight object detectors.<n>Our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models.<n>Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
arXiv Detail & Related papers (2025-10-29T08:13:17Z) - Performance Optimization of YOLO-FEDER FusionNet for Robust Drone Detection in Visually Complex Environments [0.981328290471248]
Drone detection in visually complex environments is challenging due to background clutter, small object scale, and camouflage effects.<n>This work presents an enhanced iteration of YOLO-FEDER FusionNet -- a detection framework that integrates generic object detection with camouflage object detection techniques.
arXiv Detail & Related papers (2025-09-17T14:21:00Z) - Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding [31.57375084036447]
Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning.<n>We propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities.<n>Our approach integrates Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data.
arXiv Detail & Related papers (2025-09-04T14:17:01Z) - Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z) - Robust Federated Learning on Edge Devices with Domain Heterogeneity [13.362209980631876]
Federated Learning (FL) allows collaborative training while ensuring data privacy across distributed edge devices.<n>We introduce a new framework to address this challenge by improving the generalization ability of the FL global model.<n>We introduce FedAPC, a prototype-based FL framework designed to enhance feature diversity and model robustness.
arXiv Detail & Related papers (2025-05-15T09:53:14Z) - Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.
We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.
Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - Efficient Feature Fusion for UAV Object Detection [9.632727117779178]
Small objects, in particular, occupy small portions of images, making their accurate detection difficult.<n>Existing multi-scale feature fusion methods address these challenges by aggregating features across different resolutions.<n>We propose a novel feature fusion framework specifically designed for UAV object detection tasks.
arXiv Detail & Related papers (2025-01-29T20:39:16Z) - Efficient Oriented Object Detection with Enhanced Small Object Recognition in Aerial Images [2.9138705529771123]
We present a novel enhancement to the YOLOv8 model, tailored for oriented object detection tasks.
Our model features a wavelet transform-based C2f module for capturing associative features and an Adaptive Scale Feature Pyramid (ASFP) module that leverages P2 layer details.
Our approach provides a more efficient architectural design than DecoupleNet, which has 23.3M parameters, all while maintaining detection accuracy.
arXiv Detail & Related papers (2024-12-17T05:45:48Z) - Synthetic Vision: Training Vision-Language Models to Understand Physics [9.474337395173388]
We propose two methods to enhance Vision-Language Models' physical reasoning capabilities using simulated data.<n>First, we fine-tune a pre-trained VLM using question-answer pairs generated from simulations relevant to physical reasoning tasks.<n>Second, we introduce Physics Context Builders (PCBs) to create scene descriptions enriched with physical properties and processes.
arXiv Detail & Related papers (2024-12-11T18:40:16Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.<n>We introduce key innovations to optimize generative performance for vision tasks.<n>The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - YOLOv11: An Overview of the Key Architectural Enhancements [0.5639904484784127]
The paper explores YOLOv11's expanded capabilities across various computer vision tasks, including object detection, instance segmentation, pose estimation, and oriented object detection (OBB)
We review the model's performance improvements in terms of mean Average Precision (mAP) and computational efficiency compared to its predecessors, with a focus on the trade-off between parameter count and accuracy.
Our research provides insights into YOLOv11's position within the broader landscape of object detection and its potential impact on real-time computer vision applications.
arXiv Detail & Related papers (2024-10-23T09:55:22Z) - Transformers and Slot Encoding for Sample Efficient Physical World Modelling [1.5498250598583487]
We propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene.
We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples.
arXiv Detail & Related papers (2024-05-30T15:48:04Z) - BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models.
BVS supports a large number of adjustable parameters at the scene level.
We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks [94.49801814314435]
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
We co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images.
We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks.
arXiv Detail & Related papers (2023-11-10T18:59:08Z) - Handling Data Heterogeneity via Architectural Design for Federated
Visual Recognition [16.50490537786593]
We study 19 visual recognition models from five different architectural families on four challenging FL datasets.
Our findings emphasize the importance of architectural design for computer vision tasks in practical scenarios.
arXiv Detail & Related papers (2023-10-23T17:59:16Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.