Related papers: Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

URL: http://arxiv.org/abs/2602.07051v1
Date: Wed, 04 Feb 2026 16:04:15 GMT
Title: Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning
Authors: Karthik Sivakoti,
Abstract summary: This research presents Neural Sentinel, a novel unified approach that attributes license plate recognition, state classification, and vehicle extraction through a single forward pass.<n>Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images.<n>The system achieves a mean inference latency of 152ms with an Expected Error (ECE) of 0.048, indicating well confidence estimates.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.

Related papers

Next-Generation License Plate Detection and Recognition System using YOLOv8 [0.0]
This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks.<n>The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task.
arXiv Detail & Related papers (2025-12-18T18:06:29Z)
Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision [2.0720154517628417]
We propose a novel framework combining open-vocabulary detection and cross-modal learning.<n>For traffic sign detection, our NanoVerse YOLO model integrates a vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module.<n>For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL)<n>On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition.
arXiv Detail & Related papers (2025-07-31T08:23:30Z)
Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design [0.0]
We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments.<n>The framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module.<n>This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding.
arXiv Detail & Related papers (2025-06-18T11:50:24Z)
Exploring FMCW Radars and Feature Maps for Activity Recognition: A Benchmark Study [2.251010251400407]
This study introduces a Frequency-Modulated Continuous Wave radar-based framework for human activity recognition.<n>Unlike conventional approaches that process feature maps as images, this study feeds multi-dimensional feature maps as data vectors.<n>The ConvLSTM model outperformed conventional machine learning and deep learning models, achieving an accuracy of 90.51%.
arXiv Detail & Related papers (2025-03-07T17:53:29Z)
Bridging the Gap Between End-to-End and Two-Step Text Spotting [88.14552991115207]
Bridging Text Spotting is a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods. We demonstrate the effectiveness of the proposed method through extensive experiments.
arXiv Detail & Related papers (2024-04-06T13:14:04Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Self-Supervised Representation Learning from Temporal Ordering of Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks. We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems. Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z)
Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers [105.89564687747134]
We propose a self-regularized AutoAugment method to learn views for self-supervised vision transformers. First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously. We also present a curated augmentation policy search space for self-supervised learning.
arXiv Detail & Related papers (2022-10-16T06:20:44Z)
The AI Mechanic: Acoustic Vehicle Characterization Neural Networks [1.8275108630751837]
We introduce the AI mechanic, an acoustic vehicle characterization deep learning system, using sound captured from mobile devices. We build a convolutional neural network that predicts and cascades vehicle attributes to enhance fault detection. Our cascading architecture additionally achieved 93.6% validation and 86.8% test set accuracy on misfire fault prediction, demonstrating margins of 16.4% / 7.8% and 4.2% / 1.5% improvement over na"ive and parallel baselines.
arXiv Detail & Related papers (2022-05-19T16:29:26Z)
Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification [53.6218051770131]
Cross-view consistent feature representation is key for accurate vehicle ReID. Existing approaches resort to supervised cross-view learning using extensive extra viewpoints annotations. We present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID.
arXiv Detail & Related papers (2021-03-09T11:51:09Z)
Automatic Counting and Identification of Train Wagons Based on Computer Vision and Deep Learning [70.84106972725917]
The proposed solution is cost-effective and can easily replace solutions based on radiofrequency identification (RFID) The system is able to automatically reject some of the train wagons successfully counted, as they have damaged identification codes.
arXiv Detail & Related papers (2020-10-30T14:56:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.