A System for Name and Address Parsing with Large Language Models
- URL: http://arxiv.org/abs/2601.18014v1
- Date: Sun, 25 Jan 2026 22:19:47 GMT
- Title: A System for Name and Address Parsing with Large Language Models
- Authors: Adeeba Tarannum, Muzakkiruddin Ahmed Mohammed, Mert Can Cakmak, Shames Al Mandalawi, John Talburt,
- Abstract summary: This paper introduces a prompt-driven, validation-centered framework that converts free-text records into a consistent 17-field schema without fine-tuning.<n> Evaluations on heterogeneous real-world address data show high field-level accuracy, strong schema adherence, and stable confidence calibration.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable transformation of unstructured person and address text into structured data remains a key challenge in large-scale information systems. Traditional rule-based and probabilistic approaches perform well on clean inputs but fail under noisy or multilingual conditions, while neural and large language models (LLMs) often lack deterministic control and reproducibility. This paper introduces a prompt-driven, validation-centered framework that converts free-text records into a consistent 17-field schema without fine-tuning. The method integrates input normalisation, structured prompting, constrained decoding, and strict rule-based validation under fixed experimental settings to ensure reproducibility. Evaluations on heterogeneous real-world address data show high field-level accuracy, strong schema adherence, and stable confidence calibration. The results demonstrate that combining deterministic validation with generative prompting provides a robust, interpretable, and scalable solution for structured information extraction, offering a practical alternative to training-heavy or domain-specific models.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - A high-capacity linguistic steganography based on entropy-driven rank-token mapping [81.29800498695899]
Linguistic steganography enables covert communication through embedding secret messages into innocuous texts.<n>Traditional modification-based methods introduce detectable anomalies, while retrieval-based strategies suffer from low embedding capacity.<n>We propose an entropy-driven framework called RTMStega that integrates rank-based adaptive coding and context-aware decompression with normalized entropy.
arXiv Detail & Related papers (2025-10-27T06:02:47Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z) - RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning [27.235259453535537]
RationAnomaly is a novel framework that enhances log anomaly detection by synergizing Chain-of-Thought fine-tuning with reinforcement learning.<n>We have released the corresponding resources, including code and datasets.
arXiv Detail & Related papers (2025-09-18T07:35:58Z) - Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement [4.121384394709256]
This study investigates whether test inputs for structured domains can be unified through a graph-based representation.<n>We will evaluate the effectiveness of this approach in enhancing input validity and semantic preservation across eight AI systems.
arXiv Detail & Related papers (2025-07-28T18:54:04Z) - ADALog: Adaptive Unsupervised Anomaly detection in Logs with Self-attention Masked Language Model [2.55347686868565]
ADALog is an adaptive, unsupervised anomaly detection framework.<n>It operates on individual unstructured logs, extracts intra-log contextual relationships, and performs adaptive thresholding on normal data.<n>We evaluate ADALog on benchmark datasets BGL, Thunderbird, and Spirit.
arXiv Detail & Related papers (2025-05-15T17:31:40Z) - SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities [50.6382396309597]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.<n>We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.<n>Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z) - Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model [34.1224836768324]
FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks.
This paper introduces a simple yet powerful model that nullifies the need for modality conversion.
Our approach efficiently exploits latent connections between different data types, thereby yielding comprehensive and reliable verdict predictions.
arXiv Detail & Related papers (2024-03-26T03:54:25Z) - You Can Generate It Again: Data-to-Text Generation with Verification and Correction Prompting [24.738004421537926]
Small language models like T5 excel in generating high-quality text for data-to-text tasks.<n>They frequently miss keywords, which is considered one of the most severe and common errors in this task.<n>We explore the potential of using feedback systems to enhance semantic fidelity in smaller language models for data-to-text generation tasks.
arXiv Detail & Related papers (2023-06-28T05:34:25Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.