Related papers: FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

URL: http://arxiv.org/abs/2501.01046v3
Date: Wed, 12 Mar 2025 13:36:32 GMT
Title: FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
Authors: Youngjun Son, Chaewon Kim, Jaejin Lee,
Abstract summary: Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal.<n>This paper proposes a GPU-accelerated deduplication framework, FED, that optimize MinHash LSH for GPU clusters.<n>In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16- GPU environment.
Score: 4.499466939042501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).

Related papers

GPUTOK: GPU Accelerated Byte Level BPE Tokenization [0.0]
We build a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules.<n>It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python.<n>On WikiText103 sequences up to 131k tokens, the optimized tokenizer produces the same longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer.
arXiv Detail & Related papers (2026-03-03T04:48:28Z)
A Scalable Multi-GPU Framework for Encrypted Large-Model Inference [5.966282323502589]
Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees.<n>Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their computation.<n>This paper presents Cerium, a multi- GPU framework for FHE inference on large models.
arXiv Detail & Related papers (2025-12-12T04:15:38Z)
Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval [67.21678698740267]
We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys.<n>We also develop a lightweight, stable training framework using a Bradley-Terry ranking-based loss.
arXiv Detail & Related papers (2025-08-27T10:11:27Z)
GPU-Accelerated Interpretable Generalization for Rapid Cyberattack Detection and Forensics [0.0]
IG mechanism recently published in IEEE Transactions on Information Forensics and Security delivers state-of-the-art, evidence-based intrusion detection.<n>We present IG-GPU, a PyTorch re-architecture that offloads all pairwise intersections and subset evaluations to commodity GPU.<n>In 15k-record NSL-KDD dataset, IG-GPU shows a 116-fold speed-up over the multi-core CPU implementation of IG.
arXiv Detail & Related papers (2025-07-16T12:38:19Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation [11.120838175165986]
Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML) Many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE. Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code.
arXiv Detail & Related papers (2025-02-16T12:53:23Z)
Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training [7.236249885667945]
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. Recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. We present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU TEEs.
arXiv Detail & Related papers (2025-01-20T22:23:50Z)
HashAttention: Semantic Sparsity for Faster Inference [91.54218318798603]
HashAttention is a principled approach casting pivotal token identification as a recommendation problem. It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations. It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
arXiv Detail & Related papers (2024-12-19T02:34:15Z)
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation [7.204881999658682]
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value ( KV) caching is used to store intermediate activations. The memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU.
arXiv Detail & Related papers (2024-11-26T04:03:14Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs [2.613335121517245]
Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. We propose Cheddar, an FHE library for GPU, which demonstrates significantly faster performance compared to prior GPU implementations.
arXiv Detail & Related papers (2024-07-17T23:49:18Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems. Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z)
ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU. ASH achieves higher performance, supports richer functionality, and requires fewer lines of code. ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.