Blowfish: Topological and statistical signatures for quantifying ambiguity in semantic search
- URL: http://arxiv.org/abs/2406.07990v1
- Date: Wed, 12 Jun 2024 08:26:30 GMT
- Title: Blowfish: Topological and statistical signatures for quantifying ambiguity in semantic search
- Authors: Thomas Roland Barillot, Alex De Castro,
- Abstract summary: We show that proxy ambiguous queries display different distributions of homologies 0 and 1 based features than proxy clear queries.
We propose a strategy to leverage those findings as a new scoring strategy of semantic similarities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This works reports evidence for the topological signatures of ambiguity in sentence embeddings that could be leveraged for ranking and/or explanation purposes in the context of vector search and Retrieval Augmented Generation (RAG) systems. We proposed a working definition of ambiguity and designed an experiment where we have broken down a proprietary dataset into collections of chunks of varying size - 3, 5, and 10 lines and used the different collections successively as queries and answers sets. It allowed us to test the signatures of ambiguity with removal of confounding factors. Our results show that proxy ambiguous queries (size 10 queries against size 3 documents) display different distributions of homologies 0 and 1 based features than proxy clear queries (size 5 queries against size 10 documents). We then discuss those results in terms increased manifold complexity and/or approximately discontinuous embedding submanifolds. Finally we propose a strategy to leverage those findings as a new scoring strategy of semantic similarities.
Related papers
- When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search [0.0]
Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines.<n>LLMs often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval.<n>We show that model disagreement is systematic, not random.<n>We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in policy-relevant or thematic search tasks.
arXiv Detail & Related papers (2025-07-02T20:53:51Z) - SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set
Operations [36.70770411188946]
QUEST is a dataset of 3357 natural language queries with implicit set operations.
The dataset challenges models to match multiple constraints mentioned in queries with corresponding evidence in documents.
We analyze several modern retrieval systems, finding that they often struggle on such queries.
arXiv Detail & Related papers (2023-05-19T14:19:32Z) - Explain like I am BM25: Interpreting a Dense Model's Ranked-List with a
Sparse Approximation [19.922420813509518]
We introduce the notion of equivalent queries that are generated by maximizing the similarity between the NRM's results and the result set of a sparse retrieval system.
We then compare this approach with existing methods such as RM3-based query expansion.
arXiv Detail & Related papers (2023-04-25T07:58:38Z) - Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - Aggregating Pairwise Semantic Differences for Few-Shot Claim Veracity
Classification [21.842139093124512]
We introduce SEED, a novel vector-based method to claim veracity classification.
We build on the hypothesis that we can simulate class representative vectors that capture average semantic differences for claim-evidence pairs in a class.
Experiments conducted on the FEVER and SCIFACT datasets show consistent improvements over competitive baselines in few-shot settings.
arXiv Detail & Related papers (2022-05-11T17:23:37Z) - AmbiFC: Fact-Checking Ambiguous Claims with Evidence [57.7091560922174]
We present AmbiFC, a fact-checking dataset with 10k claims derived from real-world information needs.
We analyze disagreements arising from ambiguity when comparing claims against evidence in AmbiFC.
We develop models for predicting veracity handling this ambiguity via soft labels.
arXiv Detail & Related papers (2021-04-01T17:40:08Z) - Adversarial Semantic Collisions [129.55896108684433]
We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models.
We develop gradient-based approaches for generating semantic collisions.
We show how to generate semantic collisions that evade perplexity-based filtering.
arXiv Detail & Related papers (2020-11-09T20:42:01Z) - Improving Query Safety at Pinterest [46.57632646205479]
PinSets is a system for query-set expansion.
It applies a simple yet powerful mechanism to search user sessions.
It expands a tiny seed set into thousands of related queries at nearly perfect precision.
arXiv Detail & Related papers (2020-06-20T07:35:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.