A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
- URL: http://arxiv.org/abs/2512.16183v1
- Date: Thu, 18 Dec 2025 05:08:26 GMT
- Title: A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
- Authors: Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang, Ziqiang Han,
- Abstract summary: We develop a domain-adapted extraction pipeline for structured information extraction from police incident announcements.<n>We use a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo.<n>We show that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models.
- Score: 11.463924147467297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.
Related papers
- A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video [0.2864713389096699]
This work presents a comprehensive framework for automatically detecting and extracting personal names from news videos.<n>It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics.<n>The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and inference.
arXiv Detail & Related papers (2026-02-09T19:58:50Z) - A Lightweight LLM Framework for Disaster Humanitarian Information Classification [0.0]
This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning.<n>We construct a unified experimental corpus by integrating and normalizing the HumAID dataset.<n>We demonstrate that LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only 2% of parameters.
arXiv Detail & Related papers (2026-01-21T02:05:25Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG [51.120170062795566]
We propose Divide-Then-Align (DTA) to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary.<n>DTA balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-27T08:21:21Z) - AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition [5.656347355002156]
Event cameras generate a significant amount of redundant and noisy data beyond essential object structures.<n>We propose an Adaptive Graph-based Noisy Data Removal framework for Event-based Object Recognition.<n>Our approach integrates adaptive event segmentation based on normalized density analysis, a multifactorial edge-weighting mechanism, and adaptive graph-based denoising strategies.
arXiv Detail & Related papers (2025-05-16T13:26:00Z) - Enhancing Disinformation Detection with Explainable AI and Named Entity Replacement [0.1374949083138427]
We show that non-informative elements (e.g., URLs and emoticons) should be pseudo-anonymized before training to avoid models' bias.<n>We evaluate this methodology with internal dataset and external dataset before and after applying extended data preprocessing and named entity replacement.<n>The results show that our proposal enhances on average the performance of a disinformation classification method with external test data in 65.78% without a significant decrease of the internal test performance.
arXiv Detail & Related papers (2025-02-07T12:01:26Z) - TextSleuth: Towards Explainable Tampered Text Detection [49.88698441048043]
We propose to explain the basis of tampered text detection with natural language via large multimodal models.<n>To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD.<n>Elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o.<n>To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts.
arXiv Detail & Related papers (2024-12-19T13:10:03Z) - Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling [0.0]
APEX-Embedding-7B is a 7-billion parameter decoder-only text Feature Extraction Model.
Our approach employs two training techniques that yield an emergent improvement in factual focus.
Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
arXiv Detail & Related papers (2024-10-08T17:36:48Z) - Boosting Event Extraction with Denoised Structure-to-Text Augmentation [52.21703002404442]
Event extraction aims to recognize pre-defined event triggers and arguments from texts.
Recent data augmentation methods often neglect the problem of grammatical incorrectness.
We propose a denoised structure-to-text augmentation framework for event extraction DAEE.
arXiv Detail & Related papers (2023-05-16T16:52:07Z) - Dual flow fusion model for concrete surface crack segmentation [0.0]
Cracks and other damages pose a significant threat to the safe operation of transportation infrastructure.
Deep learning models have been widely applied to practical visual segmentation tasks.
This paper proposes a crack segmentation model based on the fusion of dual streams.
arXiv Detail & Related papers (2023-05-09T02:35:58Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z) - Domain Adaptative Causality Encoder [52.779274858332656]
We leverage the characteristics of dependency trees and adversarial learning to address the tasks of adaptive causality identification and localisation.
We present a new causality dataset, namely MedCaus, which integrates all types of causality in the text.
arXiv Detail & Related papers (2020-11-27T04:14:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.