An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
- URL: http://arxiv.org/abs/2511.16132v1
- Date: Thu, 20 Nov 2025 08:07:05 GMT
- Title: An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
- Authors: Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez,
- Abstract summary: Emotion recognition from social media is critical for understanding public sentiment.<n>Accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions.<n>We introduce an interpretability-guided framework for synthetic data generation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.
Related papers
- Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content [1.215922138351105]
Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.<n>This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data.<n>We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data.
arXiv Detail & Related papers (2026-02-22T13:14:27Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - The Synthetic Mirror -- Synthetic Data at the Age of Agentic AI [0.0]
Synthetic data is artificially generated and intelligently mimicking or supplementing the real-world data.<n>This paper explores the implications for privacy and policymaking stemming from synthetic data generation.
arXiv Detail & Related papers (2025-06-15T02:10:02Z) - Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues [66.69453609603875]
Sociocultural norms serve as guiding principles for personal conduct in social interactions.
We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs)
We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase.
arXiv Detail & Related papers (2024-10-04T00:08:46Z) - Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions [17.96479268328824]
We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content.
We propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads.
arXiv Detail & Related papers (2024-08-15T18:43:50Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Counterfactual Explanations as Interventions in Latent Space [62.997667081978825]
Counterfactual explanations aim to provide to end users a set of features that need to be changed in order to achieve a desired outcome.
Current approaches rarely take into account the feasibility of actions needed to achieve the proposed explanations.
We present Counterfactual Explanations as Interventions in Latent Space (CEILS), a methodology to generate counterfactual explanations.
arXiv Detail & Related papers (2021-06-14T20:48:48Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.