A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its
Performance on PIUMA and Xeon CPU
- URL: http://arxiv.org/abs/2107.06433v1
- Date: Wed, 14 Jul 2021 00:29:18 GMT
- Title: A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its
Performance on PIUMA and Xeon CPU
- Authors: Jesmin Jahan Tithi and Fabrizio Petrini
- Abstract summary: The Word Movers Distance (WMD) measures the semantic dissimilarity between two text documents.
We present a shared-memory parallel Sinkhorn-Knopp algorithm to compute the WMD of one document against many other documents.
The parallel implementation achieves 67x speedup on 96 cores across 4 NUMA sockets of an Intel Cascade Lake system.
- Score: 0.3655021726150367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Word Movers Distance (WMD) measures the semantic dissimilarity between
two text documents by computing the cost of optimally moving all words of a
source/query document to the most similar words of a target document. Computing
WMD between two documents is costly because it requires solving an optimization
problem that costs $O (V^3 \log(V)) $ where $V$ is the number of unique words
in the document. Fortunately, WMD can be framed as an Earth Mover's Distance
(EMD) for which the algorithmic complexity can be reduced to $O(V^2)$ by adding
an entropy penalty to the optimization problem and solving it using the
Sinkhorn-Knopp algorithm. Additionally, the computation can be made highly
parallel by computing the WMD of a single query document against multiple
target documents at once, for example by finding whether a given tweet is
similar to any other tweets of a given day.
In this paper, we first present a shared-memory parallel Sinkhorn-Knopp
algorithm to compute the WMD of one document against many other documents by
adopting the $ O(V^2)$ EMD algorithm. We then algorithmically transform the
original $O(V^2)$ dense compute-heavy version into an equivalent sparse one
which is mapped onto the new Intel Programmable Integrated Unified Memory
Architecture (PIUMA) system. The WMD parallel implementation achieves 67x
speedup on 96 cores across 4 NUMA sockets of an Intel Cascade Lake system. We
also show that PIUMA cores are around 1.2-2.6x faster than Xeon cores on
Sinkhorn-WMD and also provide better strong scaling.
Related papers
- Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations.
It requires the pairwise calculation of a utility metric, which has quadratic complexity.
We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z) - Generative Sliced MMD Flows with Riesz Kernels [0.393259574660092]
Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations.
We show that MMD flows with Riesz kernels $K(x,y) = - |x-y|r$, $r in (0,2)$ have exceptional properties which allow their efficient computation.
arXiv Detail & Related papers (2023-05-19T06:33:57Z) - Cooperative Multi-Agent Reinforcement Learning: Asynchronous
Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes.
We propose a provably efficient algorithm based on value that enable asynchronous communication.
We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z) - Scalable Quantum Error Correction for Surface Codes using FPGA [67.74017895815125]
A fault-tolerant quantum computer must decode and correct errors faster than they appear.
We report a distributed version of the Union-Find decoder that exploits parallel computing resources for further speedup.
The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure.
arXiv Detail & Related papers (2023-01-20T04:23:00Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - A Push-Relabel Based Additive Approximation for Optimal Transport [5.111364864495785]
Exact algorithms for computing Optimal Transport can be slow.
We introduce a new and very simple approach to find an $varepsilon$approximation of the OT distance.
Our algorithm achieves a near-optimal execution time of $O(n2/varepsilon2)$ for computing OT distance.
arXiv Detail & Related papers (2022-03-07T21:40:14Z) - Compressing 1D Time-Channel Separable Convolutions using Sparse Random
Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$.
For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size.
For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z) - Exact, Parallelizable Dynamic Time Warping Alignment with Linear Memory [0.0]
We present a divide and conquer algorithm that computes the exact globally optimal DTW alignment using O(M+N) memory.
Our algorithm can be parallelized up to a factor of min(M, N) with the same memory constraints, so it can still run more efficiently than the textbook version with an adequate GPU.
arXiv Detail & Related papers (2020-08-04T15:00:33Z) - An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute
the Word Mover's Distance [0.18275108630751835]
The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents.
We present a shared-memory parallel algorithm to compute WMD of one document against many other documents.
arXiv Detail & Related papers (2020-05-14T05:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.