VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and
Optimized Search
- URL: http://arxiv.org/abs/2409.17383v1
- Date: Wed, 25 Sep 2024 21:58:08 GMT
- Title: VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and
Optimized Search
- Authors: Solmaz Seyed Monir, Irene Lau, Shubing Yang, Dongfang Zhao
- Abstract summary: We propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval.
By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy.
Experiments on real-world datasets show that VectorSearch outperforms baseline metrics.
- Score: 1.0411820336052784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional retrieval methods have been essential for assessing document
similarity but struggle with capturing semantic nuances. Despite advancements
in latent semantic analysis (LSA) and deep learning, achieving comprehensive
semantic understanding and accurate retrieval remains challenging due to high
dimensionality and semantic gaps. The above challenges call for new techniques
to effectively reduce the dimensions and close the semantic gaps. To this end,
we propose VectorSearch, which leverages advanced algorithms, embeddings, and
indexing techniques for refined retrieval. By utilizing innovative multi-vector
search operations and encoding searches with advanced language models, our
approach significantly improves retrieval accuracy. Experiments on real-world
datasets show that VectorSearch outperforms baseline metrics, demonstrating its
efficacy for large-scale retrieval tasks.
Related papers
- Hybrid Semantic Search: Unveiling User Intent Beyond Keywords [0.0]
This paper addresses the limitations of traditional keyword-based search in understanding user intent.
It introduces a novel hybrid search approach that leverages the strengths of non-semantic search engines, Large Language Models (LLMs), and embedding models.
arXiv Detail & Related papers (2024-08-17T16:04:31Z) - Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language [0.0]
This paper endeavors to establish a straightforward yet potent benchmark for semantic search in Arabic.
To precisely evaluate the effectiveness of these metrics and the dataset, we conduct our assessment of semantic search within the framework of retrieval augmented generation (RAG)
arXiv Detail & Related papers (2024-03-27T08:42:31Z) - Enhancing Cloud-Based Large Language Model Processing with Elasticsearch
and Transformer Models [17.09116903102371]
Large Language Models (LLMs) are a class of generative AI models built using the Transformer network.
LLMs are capable of leveraging vast datasets to identify, summarize, translate, predict, and generate language.
Semantic vector search within large language models is a potent technique that can significantly enhance search result accuracy and relevance.
arXiv Detail & Related papers (2024-02-24T12:31:22Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Semantic Search for Large Scale Clinical Ontologies [63.71950996116403]
We present a deep learning approach to build a search system for large clinical vocabularies.
We propose a Triplet-BERT model and a method that generates training data based on semantic training data.
The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to searching concept vocabularies.
arXiv Detail & Related papers (2022-01-01T05:15:42Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z) - Leveraging Cognitive Search Patterns to Enhance Automated Natural
Language Retrieval Performance [0.0]
We show that cognitive reformulation patterns that mimic user search behaviour are highlighted.
We formalize the application of these patterns by considering a query conceptual representation.
A genetic algorithm-based weighting process allows placing emphasis on terms according to their conceptual role-type.
arXiv Detail & Related papers (2020-04-21T14:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.