Related papers: A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPU

A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPU

URL: http://arxiv.org/abs/2107.06433v1
Date: Wed, 14 Jul 2021 00:29:18 GMT
Title: A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPU
Authors: Jesmin Jahan Tithi and Fabrizio Petrini
Abstract summary: The Word Movers Distance (WMD) measures the semantic dissimilarity between two text documents. We present a shared-memory parallel Sinkhorn-Knopp algorithm to compute the WMD of one document against many other documents. The parallel implementation achieves 67x speedup on 96 cores across 4 NUMA sockets of an Intel Cascade Lake system.
Score: 0.3655021726150367
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Word Movers Distance (WMD) measures the semantic dissimilarity between two text documents by computing the cost of optimally moving all words of a source/query document to the most similar words of a target document. Computing WMD between two documents is costly because it requires solving an optimization problem that costs $O (V^3 \log(V)) $ where $V$ is the number of unique words in the document. Fortunately, WMD can be framed as an Earth Mover's Distance (EMD) for which the algorithmic complexity can be reduced to $O(V^2)$ by adding an entropy penalty to the optimization problem and solving it using the Sinkhorn-Knopp algorithm. Additionally, the computation can be made highly parallel by computing the WMD of a single query document against multiple target documents at once, for example by finding whether a given tweet is similar to any other tweets of a given day. In this paper, we first present a shared-memory parallel Sinkhorn-Knopp algorithm to compute the WMD of one document against many other documents by adopting the $ O(V^2)$ EMD algorithm. We then algorithmically transform the original $O(V^2)$ dense compute-heavy version into an equivalent sparse one which is mapped onto the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) system. The WMD parallel implementation achieves 67x speedup on 96 cores across 4 NUMA sockets of an Intel Cascade Lake system. We also show that PIUMA cores are around 1.2-2.6x faster than Xeon cores on Sinkhorn-WMD and also provide better strong scaling.

Related papers

Classical bounds on correlation-type Bell expressions and linear prepare-and-measure witnesses: efficient computation in parallel environments such as graphics processing units [0.0]
The presented program aims at speeding up the brute force computation of the $L_d$ norm of a matrix $M$ using graphics processing units (GPUs) Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment.
arXiv Detail & Related papers (2025-03-27T15:14:32Z)
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [83.89771461061903]
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs) Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs) We identify two key challenges contributing to this inefficiency: $textitover-exploration$ due to redundant states with semantically equivalent content, and $textitunder-exploration$ caused by high variance in verifier scoring. We propose FETCH, a flexible, plug-and-play system compatible with various tree search algorithms.
arXiv Detail & Related papers (2025-02-16T16:12:01Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations. It requires the pairwise calculation of a utility metric, which has quadratic complexity. We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z)
Generative Sliced MMD Flows with Riesz Kernels [0.393259574660092]
Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. We show that MMD flows with Riesz kernels $K(x,y) = - |x-y|r$, $r in (0,2)$ have exceptional properties which allow their efficient computation.
arXiv Detail & Related papers (2023-05-19T06:33:57Z)
Cooperative Multi-Agent Reinforcement Learning: Asynchronous Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes. We propose a provably efficient algorithm based on value that enable asynchronous communication. We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z)
Scalable Quantum Error Correction for Surface Codes using FPGA [67.74017895815125]
A fault-tolerant quantum computer must decode and correct errors faster than they appear. We report a distributed version of the Union-Find decoder that exploits parallel computing resources for further speedup. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure.
arXiv Detail & Related papers (2023-01-20T04:23:00Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
A Push-Relabel Based Additive Approximation for Optimal Transport [5.111364864495785]
Exact algorithms for computing Optimal Transport can be slow. We introduce a new and very simple approach to find an $varepsilon$approximation of the OT distance. Our algorithm achieves a near-optimal execution time of $O(n2/varepsilon2)$ for computing OT distance.
arXiv Detail & Related papers (2022-03-07T21:40:14Z)
Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$. For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size. For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z)
Exact, Parallelizable Dynamic Time Warping Alignment with Linear Memory [0.0]
We present a divide and conquer algorithm that computes the exact globally optimal DTW alignment using O(M+N) memory. Our algorithm can be parallelized up to a factor of min(M, N) with the same memory constraints, so it can still run more efficiently than the textbook version with an adequate GPU.
arXiv Detail & Related papers (2020-08-04T15:00:33Z)
An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance [0.18275108630751835]
The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents. We present a shared-memory parallel algorithm to compute WMD of one document against many other documents.
arXiv Detail & Related papers (2020-05-14T05:30:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.