Generative Deduplication For Socia Media Data Selection
- URL: http://arxiv.org/abs/2401.05883v3
- Date: Thu, 03 Oct 2024 03:34:34 GMT
- Title: Generative Deduplication For Socia Media Data Selection
- Authors: Xianming Li, Jing Li,
- Abstract summary: We propose a novel Generative Deduplication framework for social media data selection.
Our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines.
- Score: 4.545354973721937
- License:
- Abstract: Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model's potential to broadly advance social media language understanding in effectiveness and efficiency.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Building Resilience to Out-of-Distribution Visual Data via Input
Optimization and Model Finetuning [13.804184845195296]
We propose a preprocessing model that learns to optimise input data for a specific target vision model.
We investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles.
We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model.
arXiv Detail & Related papers (2022-11-29T14:06:35Z) - Leveraging Key Information Modeling to Improve Less-Data Constrained
News Headline Generation via Duality Fine-Tuning [12.443476695459553]
We propose a novel duality fine-tuning method by formally defining the probabilistic duality constraints between key information prediction and headline generation tasks.
The proposed method can capture more information from limited data, build connections between separate tasks, and is suitable for less-data constrained generation tasks.
We conduct extensive experiments to demonstrate that our method is effective and efficient to achieve improved performance in terms of language modeling metric and informativeness correctness metric on two public datasets.
arXiv Detail & Related papers (2022-10-10T07:59:36Z) - Federated Pruning: Improving Neural Network Efficiency with Federated
Learning [24.36174705715827]
We propose Federated Pruning to train a reduced model under the federated setting.
We explore different pruning schemes and provide empirical evidence of the effectiveness of our methods.
arXiv Detail & Related papers (2022-09-14T00:48:37Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.