Extracting Entities and Topics from News and Connecting Criminal Records
- URL: http://arxiv.org/abs/2005.00950v1
- Date: Sun, 3 May 2020 00:06:01 GMT
- Title: Extracting Entities and Topics from News and Connecting Criminal Records
- Authors: Quang Pham, Marija Stanojevic, Zoran Obradovic
- Abstract summary: This paper summarizes methodologies used in extracting entities and topics from a database of criminal records and from a database of newspapers.
Statistical models had successfully been used in studying the topics of roughly 300,000 New York Times articles.
analytical approaches, especially in hotspot mapping, were used in some researches with an aim to predict crime locations and circumstances in the future.
- Score: 6.685013315842082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is to summarize methodologies used in extracting
entities and topics from a database of criminal records and from a database of
newspapers. Statistical models had successfully been used in studying the
topics of roughly 300,000 New York Times articles. In addition, these models
had also been used to successfully analyze entities related to people,
organizations, and places (D Newman, 2006). Additionally, analytical
approaches, especially in hotspot mapping, were used in some researches with an
aim to predict crime locations and circumstances in the future, and those
approaches had been tested quite successfully (S Chainey, 2008). Based on the
two above notions, this research was performed with the intention to apply data
science techniques in analyzing a big amount of data, selecting valuable
intelligence, clustering violations depending on their types of crime, and
creating a crime graph that changes through time. In this research, the task
was to download criminal datasets from Kaggle and a collection of news articles
from Kaggle and EAGER project databases, and then to merge these datasets into
one general dataset. The most important goal of this project was performing
statistical and natural language processing methods to extract entities and
topics as well as to group similar data points into correct clusters, in order
to understand public data about U.S related crimes better.
Related papers
- Advancing Crime Linkage Analysis with Machine Learning: A Comprehensive Review and Framework for Data-Driven Approaches [0.0]
Crime linkage is the process of analyzing criminal behavior data to determine whether a pair or group of crime cases are connected or belong to a series of offenses.
This study aims to understand the challenges faced by machine learning approaches in crime linkage and to support foundational knowledge for future data-driven methods.
arXiv Detail & Related papers (2024-10-30T18:22:45Z) - Entity Extraction from High-Level Corruption Schemes via Large Language Models [4.820586736502356]
This article proposes a new micro-benchmark dataset for algorithms and models that identify individuals and organizations in news articles.
Experimental efforts are also reported, using this dataset, to identify individuals and organizations in financial-crime-related articles.
arXiv Detail & Related papers (2024-09-05T10:27:32Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Classifying Crime Types using Judgment Documents from Social Media [11.16381622758947]
The task of determining crime types based on criminal behavior facts has become a very important and meaningful task in social science.
The data samples themselves are unevenly distributed, due to the nature of the crime itself.
This article proposes a new training model to solve this problem through NLP processing methods.
arXiv Detail & Related papers (2023-06-29T15:12:24Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Prediction of Homicides in Urban Centers: A Machine Learning Approach [0.8312466807725921]
This research presents a machine learning model to predict homicide crimes, using a dataset that uses generic data.
Analyses were performed with simple and robust algorithms on the created dataset.
Results are considered as a baseline for the proposed problem.
arXiv Detail & Related papers (2020-08-16T19:13:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.