Explaining Categorical Feature Interactions Using Graph Covariance and LLMs
- URL: http://arxiv.org/abs/2501.14932v1
- Date: Fri, 24 Jan 2025 21:41:26 GMT
- Title: Explaining Categorical Feature Interactions Using Graph Covariance and LLMs
- Authors: Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe,
- Abstract summary: This paper focuses on the global synthetic dataset from the Counter Trafficking Data Collaborative.
It contains over 200,000 anonymized records spanning from 2002 to 2022 with numerous categorical features for each record.
We propose a fast and scalable method for analyzing and extracting significant categorical feature interactions.
- Score: 18.44675735926458
- License:
- Abstract: Modern datasets often consist of numerous samples with abundant features and associated timestamps. Analyzing such datasets to uncover underlying events typically requires complex statistical methods and substantial domain expertise. A notable example, and the primary data focus of this paper, is the global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC) -- a global hub of human trafficking data containing over 200,000 anonymized records spanning from 2002 to 2022, with numerous categorical features for each record. In this paper, we propose a fast and scalable method for analyzing and extracting significant categorical feature interactions, and querying large language models (LLMs) to generate data-driven insights that explain these interactions. Our approach begins with a binarization step for categorical features using one-hot encoding, followed by the computation of graph covariance at each time. This graph covariance quantifies temporal changes in dependence structures within categorical data and is established as a consistent dependence measure under the Bernoulli distribution. We use this measure to identify significant feature pairs, such as those with the most frequent trends over time or those exhibiting sudden spikes in dependence at specific moments. These extracted feature pairs, along with their timestamps, are subsequently passed to an LLM tasked with generating potential explanations of the underlying events driving these dependence changes. The effectiveness of our method is demonstrated through extensive simulations, and its application to the CTDC dataset reveals meaningful feature pairs and potential data stories underlying the observed feature interactions.
Related papers
- Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting.
Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server.
We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z) - TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph.
Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales.
We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z) - Coupled Attention Networks for Multivariate Time Series Anomaly
Detection [10.620044922371177]
We propose a coupled attention-based neural network framework (CAN) for anomaly detection in multivariate time series data.
To capture inter-sensor relationships and temporal dependencies, a convolutional neural network based on the global-local graph is integrated with a temporal self-attention module.
arXiv Detail & Related papers (2023-06-12T13:42:56Z) - Dynamic Relation Discovery and Utilization in Multi-Entity Time Series
Forecasting [92.32415130188046]
In many real-world scenarios, there could exist crucial yet implicit relation between entities.
We propose an attentional multi-graph neural network with automatic graph learning (A2GNN) in this work.
arXiv Detail & Related papers (2022-02-18T11:37:04Z) - PIETS: Parallelised Irregularity Encoders for Forecasting with
Heterogeneous Time-Series [5.911865723926626]
Heterogeneity and irregularity of multi-source data sets present a significant challenge to time-series analysis.
In this work, we design a novel architecture, PIETS, to model heterogeneous time-series.
We show that PIETS is able to effectively model heterogeneous temporal data and outperforms other state-of-the-art approaches in the prediction task.
arXiv Detail & Related papers (2021-09-30T20:01:19Z) - OR-Net: Pointwise Relational Inference for Data Completion under Partial
Observation [51.083573770706636]
This work uses relational inference to fill in the incomplete data.
We propose Omni-Relational Network (OR-Net) to model the pointwise relativity in two aspects.
arXiv Detail & Related papers (2021-05-02T06:05:54Z) - Mining Feature Relationships in Data [0.0]
Feature relationship mining (FRM) uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data.
Our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features.
Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships.
arXiv Detail & Related papers (2021-02-02T07:06:16Z) - Connecting the Dots: Multivariate Time Series Forecasting with Graph
Neural Networks [91.65637773358347]
We propose a general graph neural network framework designed specifically for multivariate time series data.
Our approach automatically extracts the uni-directed relations among variables through a graph learning module.
Our proposed model outperforms the state-of-the-art baseline methods on 3 of 4 benchmark datasets.
arXiv Detail & Related papers (2020-05-24T04:02:18Z) - Modeling Rare Interactions in Time Series Data Through Qualitative
Change: Application to Outcome Prediction in Intensive Care Units [1.0349800230036503]
We present a model for uncovering interactions with the highest likelihood of generating the outcomes seen from highly-dimensional time series data.
Using the assumption that similar templates of small interactions are responsible for the outcomes, we reformulate the discovery task to retrieve the most-likely templates from the data.
arXiv Detail & Related papers (2020-04-03T08:49:40Z) - Transformer Hawkes Process [79.16290557505211]
We propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies.
THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin.
We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.
arXiv Detail & Related papers (2020-02-21T13:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.