Noise-Robust De-Duplication at Scale
- URL: http://arxiv.org/abs/2210.04261v2
- Date: Wed, 24 Apr 2024 12:57:24 GMT
- Title: Noise-Robust De-Duplication at Scale
- Authors: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell,
- Abstract summary: This study uses the unique timeliness of historical news wires to create a 27,210 document dataset.
We develop and evaluate a range of de-duplication methods, including hashing and N-gram overlap.
We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours.
- Score: 4.499833362998488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.
Related papers
- Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Learning Confident Classifiers in the Presence of Label Noise [5.829762367794509]
This paper proposes a probabilistic model for noisy observations that allows us to build a confident classification and segmentation models.
Our experiments show that our algorithm outperforms state-of-the-art solutions for the considered classification and segmentation problems.
arXiv Detail & Related papers (2023-01-02T04:27:25Z) - A Few Shot Multi-Representation Approach for N-gram Spotting in
Historical Manuscripts [1.2930503923129213]
We propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram)
We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary.
arXiv Detail & Related papers (2022-09-21T15:35:02Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - On the Importance of Encrypting Deep Features [15.340540198612823]
We analyze model inversion attacks with only two assumptions: feature vectors of user data are known, and a black-box API for inference is provided.
Experiments have been conducted on state-of-the-art models in person re-identification, and two attack scenarios (i.e., recognizing auxiliary attributes and reconstructing user data) are investigated.
Results show that an adversary could successfully infer sensitive information even under severe constraints.
arXiv Detail & Related papers (2021-08-16T15:22:33Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - Temporal Calibrated Regularization for Robust Noisy Label Learning [60.90967240168525]
Deep neural networks (DNNs) exhibit great success on many tasks with the help of large-scale well annotated datasets.
However, labeling large-scale data can be very costly and error-prone so that it is difficult to guarantee the annotation quality.
We propose a Temporal Calibrated Regularization (TCR) in which we utilize the original labels and the predictions in the previous epoch together.
arXiv Detail & Related papers (2020-07-01T04:48:49Z) - A generic ensemble based deep convolutional neural network for
semi-supervised medical image segmentation [7.141405427125369]
We propose a generic semi-supervised learning framework for image segmentation based on a deep convolutional neural network (DCNN)
Our method is able to significantly improve beyond fully supervised model learning by incorporating unlabeled data.
arXiv Detail & Related papers (2020-04-16T23:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.