Large language model as user daily behavior data generator: balancing population diversity and individual personality
- URL: http://arxiv.org/abs/2505.17615v1
- Date: Fri, 23 May 2025 08:22:09 GMT
- Title: Large language model as user daily behavior data generator: balancing population diversity and individual personality
- Authors: Haoxin Li, Jingtao Ding, Jiahui Gong, Yong Li,
- Abstract summary: We introduce BehaviorGen, a framework that uses large language models to generate high-quality synthetic behavior data.<n>By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models.<n>We evaluate its performance in scenarios such as augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions.
- Score: 12.464365435176099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.
Related papers
- Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z) - Self-Supervised Learning-Based Multimodal Prediction on Prosocial Behavior Intentions [6.782784535456252]
There are no large, labeled datasets available for prosocial behavior.<n>Small-scale datasets make it difficult to train deep-learning models effectively.<n>We propose a self-supervised learning approach that harnesses multi-modal data.
arXiv Detail & Related papers (2025-07-11T00:49:46Z) - BehaveGPT: A Foundation Model for Large-scale User Behavior Modeling [14.342911841456663]
We propose BehaveGPT, a foundational model designed specifically for large-scale user behavior prediction.<n>BehaveGPT is trained on vast user behavior datasets, allowing it to learn complex behavior patterns.<n>Our approach introduces the DRO-based pretraining paradigm tailored for user behavior data, which improves model generalization and transferability.
arXiv Detail & Related papers (2025-05-23T08:43:46Z) - Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z) - Uncertainty-aware Human Mobility Modeling and Anomaly Detection [24.22648449430148]
We formulate anomaly detection in human behavior modeling raw GPS data as sequence stay-point events.<n>We equip our proposed model USTAD with aleatoric uncertainty estimation.<n>Experiments show that USTAD improves anomaly detection AUCROC by 3%-15% over baselines in industry-scale data.
arXiv Detail & Related papers (2024-10-02T06:57:08Z) - DataGen: Unified Synthetic Dataset Generation via Large Language Models [88.16197692794707]
DataGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.<n>To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature.<n>Extensive experiments demonstrate the superior quality of data generated by DataGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - A prediction and behavioural analysis of machine learning methods for
modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice.
Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares.
It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z) - Incorporating Heterogeneous User Behaviors and Social Influences for
Predictive Analysis [32.31161268928372]
We aim to incorporate heterogeneous user behaviors and social influences for behavior predictions.
This paper proposes a variant of Long-Short Term Memory (LSTM) which can consider context while a behavior sequence.
A residual learning-based decoder is designed to automatically construct multiple high-order cross features based on social behavior representation.
arXiv Detail & Related papers (2022-07-24T17:05:37Z) - Generating synthetic mobility data for a realistic population with RNNs
to improve utility and privacy [3.3918638314432936]
We present a system for generating synthetic mobility data using a deep recurrent neural network (RNN)
The system takes a population distribution as input and generates mobility traces for a corresponding synthetic population.
We show the generated mobility data retain the characteristics of the real data, while varying from the real data at the individual level.
arXiv Detail & Related papers (2022-01-04T13:58:22Z) - Learning Transferrable Parameters for Long-tailed Sequential User
Behavior Modeling [70.64257515361972]
We argue that focusing on tail users could bring more benefits and address the long tails issue.
Specifically, we propose a gradient alignment and adopt an adversarial training scheme to facilitate knowledge transfer from the head to the tail.
arXiv Detail & Related papers (2020-10-22T03:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.