Related papers: Towards High-Fidelity Synthetic Multi-platform Social Media Datasets via Large Language Models

Towards High-Fidelity Synthetic Multi-platform Social Media Datasets via Large Language Models

URL: http://arxiv.org/abs/2505.02858v1
Date: Fri, 02 May 2025 18:56:01 GMT
Title: Towards High-Fidelity Synthetic Multi-platform Social Media Datasets via Large Language Models
Authors: Henry Tari, Nojus Sereiva, Rishabh Kaushal, Thales Bertaglia, Adriana Iamnitchi,
Abstract summary: Social media datasets are essential for research on a variety of topics, such as disinformation, influence operations, hate speech detection, or influencer marketing practices.<n>Access to social media datasets is often constrained due to costs and platform restrictions.<n>This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social media datasets are essential for research on a variety of topics, such as disinformation, influence operations, hate speech detection, or influencer marketing practices. However, access to social media datasets is often constrained due to costs and platform restrictions. Acquiring datasets that span multiple platforms, which is crucial for understanding the digital ecosystem, is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real data. We propose multi-platform topic-based prompting and employ various language models to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings show that using large language models to generate synthetic multi-platform social media data is promising, different language models perform differently in terms of fidelity, and a post-processing approach might be needed for generating high-fidelity synthetic datasets for research. In addition to the empirical evaluation of three state of the art large language models, our contributions include new fidelity metrics specific to multi-platform social media datasets.

Related papers

Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data.<n>We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities?<n>Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios?<n>Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.<n>However, the limited labeled multimodal data often hinders embedding performance.<n>Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z)
RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z)
Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research [31.993279516471283]
We conduct an in-depth examination of 20 datasets extensively used in NLP for Computational Social Science. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Our findings suggest that data duplication has an impact on the current claims of state-of-the-art performance.
arXiv Detail & Related papers (2024-10-04T15:58:15Z)
Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions [17.96479268328824]
We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content. We propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads.
arXiv Detail & Related papers (2024-08-15T18:43:50Z)
Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. Access to these datasets is often restricted due to costs and platform regulations. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z)
Language and Multimodal Models in Sports: A Survey of Datasets and Applications [20.99857526324661]
Recent integration of Natural Language Processing (NLP) and multimodal models has advanced the field of sports analytics. This survey presents a comprehensive review of the datasets and applications driving these innovations post-2020.
arXiv Detail & Related papers (2024-06-18T03:59:26Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms. We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z)
I Know Where You Are Coming From: On the Impact of Social Media Sources on AI Model Performance [79.05613148641018]
We will study the performance of different machine learning models when being learned on multi-modal data from different social networks. Our initial experimental results reveal that social network choice impacts the performance.
arXiv Detail & Related papers (2020-02-05T11:10:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.