Related papers: Layout-Independent License Plate Recognition via Integrated Vision and Language Models

Layout-Independent License Plate Recognition via Integrated Vision and Language Models

URL: http://arxiv.org/abs/2510.10533v1
Date: Sun, 12 Oct 2025 10:25:21 GMT
Title: Layout-Independent License Plate Recognition via Integrated Vision and Language Models
Authors: Elham Shabaninia, Fatemeh Asadi-zeydabadi, Hossein Nezamabadi-pour,
Abstract summary: This work presents a pattern-aware framework for automatic license plate recognition (ALPR)<n>It is designed to operate reliably across diverse plate layouts and challenging real-world conditions.<n> Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches.
Score: 6.302166748545872
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.

Related papers

LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models [4.497411606350301]
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination.<n>The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition.<n>We propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL.
arXiv Detail & Related papers (2026-01-14T03:32:55Z)
Automated Invoice Data Extraction: Using LLM and OCR [0.0]
This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, Large Language Models (LLMs) and graph analytics to achieve unprecedented extraction quality and consistency.
arXiv Detail & Related papers (2025-11-01T19:05:09Z)
Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization [28.984638316524464]
We propose Visually-Anchored Policy Optimization (VAPO) to control the model's reasoning process.<n>VAPO enforces a structured "Look before Transcription" procedure using a think>answer> format.<n>This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency.
arXiv Detail & Related papers (2025-10-08T08:18:47Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent [8.212818176634116]
We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness.
arXiv Detail & Related papers (2024-11-08T15:50:30Z)
Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z)
A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z)
Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework. Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z)
Towards Auto-Modeling of Formal Verification for NextG Protocols: A Multimodal cross- and self-attention Large Language Model Approach [3.9155346446573502]
This paper introduces Auto-modeling of Formal Verification with Real-world Prompting for 5G and NextG protocols (AVRE) AVRE is a novel system designed for the formal verification of Next Generation (NextG) communication protocols.
arXiv Detail & Related papers (2023-12-28T20:41:24Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks. In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.