Related papers: SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

URL: http://arxiv.org/abs/2602.10908v1
Date: Wed, 11 Feb 2026 14:40:15 GMT
Title: SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
Authors: Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi,
Abstract summary: We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds.<n>Our approach employs string matching based on suffix arrays that scales well with corpus size.<n>We show that our method achieves significantly lower search latency than existing methods.
Score: 13.87355367137306
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

Related papers

Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE [57.67972800581953]
We introduce TARE, a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases.<n>We also propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity.
arXiv Detail & Related papers (2025-09-28T23:57:05Z)
SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches [5.80278230280824]
We propose a novel algorithm that achieves semantic yet efficient pattern matching by relaxing a surface-level matching with word embeddings.<n>Our experiments demonstrate that the proposed method can execute searches on billion-scale corpora in less than a second.
arXiv Detail & Related papers (2025-03-05T17:53:11Z)
A Three-Stage Algorithm for the Closest String Problem on Artificial and Real Gene Sequences [39.58317527488534]
Closest String Problem is an NP-hard problem that aims to find a string that has the minimum distance from all sequences that belong to the given set of strings. In this paper, we introduce a three-stage algorithm that comprises the following process: first, we apply a novel alphabet pruning method to reduce the search space for effectively finding promising search regions. Second, a variant of beam search to find a solution is employed. This method utilizes a newly developed guiding function based on an expected distance score of partial solutions.
arXiv Detail & Related papers (2024-07-17T21:26:27Z)
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations [8.796275989527054]
We propose a novel organization of the inverted index that enables fast retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions.
arXiv Detail & Related papers (2024-04-29T15:49:27Z)
Stream of Search (SoS): Learning to Search in Language [29.841835308845948]
We show how language models can be taught to search by representing the process of search in language as a flattened string. We propose a unified language for search that captures an array of different symbolic search strategies. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
arXiv Detail & Related papers (2024-04-01T06:50:52Z)
Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently. We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z)
Bridging the Training-Inference Gap for Dense Phrase Retrieval [104.4836127502683]
Building dense retrievers requires a series of standard procedures, including training and validating neural models. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced. We propose an efficient way of validating dense retrievers using a small subset of the entire corpus.
arXiv Detail & Related papers (2022-10-25T00:53:06Z)
GOLD-NAS: Gradual, One-Level, Differentiable [100.12492801459105]
We propose a novel algorithm named Gradual One-Level Differentiable Neural Architecture Search (GOLD-NAS) It introduces a variable resource constraint to one-level optimization so that the weak operators are gradually pruned out from the super-network.
arXiv Detail & Related papers (2020-07-07T10:37:49Z)
Interactive Extractive Search over Biomedical Corpora [41.72755714431404]
We present a system that allows life-science researchers to search a linguistically annotated corpus of texts. We introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations. Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine.
arXiv Detail & Related papers (2020-06-07T13:26:32Z)
Learning to Accelerate Heuristic Searching for Large-Scale Maximum Weighted b-Matching Problems in Online Advertising [51.97494906131859]
Bipartite b-matching is fundamental in algorithm design, and has been widely applied into economic markets, labor markets, etc. Existing exact and approximate algorithms usually fail in such settings due to either requiring intolerable running time or too much computation resource. We propose textttNeuSearcher which leverages the knowledge learned from previously instances to solve new problem instances.
arXiv Detail & Related papers (2020-05-09T02:48:23Z)
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder. Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.