Related papers: Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

URL: http://arxiv.org/abs/2410.03545v1
Date: Fri, 4 Oct 2024 15:58:15 GMT
Title: Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research
Authors: Yida Mu, Mali Jin, Xingyi Song, Nikolaos Aletras,
Abstract summary: We conduct an in-depth examination of 20 datasets extensively used in NLP for Computational Social Science. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Our findings suggest that data duplication has an impact on the current claims of state-of-the-art performance.
Score: 31.993279516471283
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving dataset development from social media data and its usage.

Related papers

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs) Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Data Assetization via Resources-decoupled Federated Learning [7.347554648348435]
Federated learning (FL) provides an effective approach to collaborative training models while preserving privacy. We first propose a framework for resource-decoupled FL involving three parties. Next, we propose the Quality-aware Dynamic Resources-decoupled FL algorithm (QD-RDFL)
arXiv Detail & Related papers (2025-01-24T15:49:04Z)
A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. Access to these datasets is often restricted due to costs and platform regulations. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Curating Grounded Synthetic Data with Global Perspectives for Equitable AI [0.5120567378386615]
We introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification. We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations. Preliminary results demonstrate substantial improvements in performance on traditional NER benchmarks, by up to 7.3%.
arXiv Detail & Related papers (2024-06-10T17:59:11Z)
A Comparative Study on Enhancing Prediction in Social Network Advertisement through Data Augmentation [0.6707149143800017]
This study presents and explores a generative augmentation framework of social network advertising data. Our framework explores three generative models for data augmentation - Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Gaussian Mixture Models (GMMs)
arXiv Detail & Related papers (2024-04-22T01:16:11Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP. We highlight its advantages, including data augmentation potential and the introduction of structured variety. We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data [77.88594632644347]
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks. In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge. We propose a novel momentum-based method to mitigate this decentralized training difficulty.
arXiv Detail & Related papers (2021-02-09T11:27:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.