A Comprehensive Survey on Vector Database: Storage and Retrieval
Technique, Challenge
- URL: http://arxiv.org/abs/2310.11703v1
- Date: Wed, 18 Oct 2023 04:31:06 GMT
- Title: A Comprehensive Survey on Vector Database: Storage and Retrieval
Technique, Challenge
- Authors: Yikun Han, Chunjiang Liu, Pengfei Wang
- Abstract summary: The approximate nearest neighbor search problem behind vector databases has been studied for a long time.
This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area.
- Score: 4.579314354865921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A vector database is used to store high-dimensional data that cannot be
characterized by traditional DBMS. Although there are not many articles
describing existing or introducing new vector database architectures, the
approximate nearest neighbor search problem behind vector databases has been
studied for a long time, and considerable related algorithmic articles can be
found in the literature. This article attempts to comprehensively review
relevant algorithms to provide a general understanding of this booming research
area. The basis of our framework categorises these studies by the approach of
solving ANNS problem, respectively hash-based, tree-based, graph-based and
quantization-based approaches. Then we present an overview of existing
challenges for vector databases. Lastly, we sketch how vector databases can be
combined with large language models and provide new possibilities.
Related papers
- Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - Using text embedding models and vector databases as text classifiers
with the example of medical data [0.0]
We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine.
We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
arXiv Detail & Related papers (2024-02-07T22:15:15Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - A Hierarchical Approach to exploiting Multiple Datasets from TalkBank [0.0]
This paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection.
The framework can also be adapted to process data from other open-science platforms.
arXiv Detail & Related papers (2023-06-21T22:37:51Z) - Rethinking Complex Queries on Knowledge Graphs with Neural Link
Predictors [65.56849255423866]
We propose a new neural-symbolic method to support end-to-end learning using complex queries with provable reasoning capability.
We develop a new dataset containing ten new types of queries with features that have never been considered.
Our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.
arXiv Detail & Related papers (2023-04-14T11:35:35Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Interpreting multi-variate models with setPCA [0.038478302549231076]
We present an algorithmic method which has been developed to integrate "omics" data with existing databases of background knowledge.
We have produced a Graphical User Interface (GUI) in Matlab which allows the overlay of known set information onto the loadings plot.
For each known set the optimal convex hull, covering a subset of elements from the known set, is found through a search algorithm and displayed.
arXiv Detail & Related papers (2021-11-17T14:22:19Z) - Complex Coordinate-Based Meta-Analysis with Probabilistic Programming [0.0]
Coordinate-based meta-analysis (CBMA) databases are built by automatically extracting both coordinates of reported peak activations and term associations.
We show how recent lifted query processing algorithms make it possible to scale to the size of large neuroimaging data.
We demonstrate results for two-term conjunctive queries, both on simulated meta-analysis databases and on the widely-used Neurosynth database.
arXiv Detail & Related papers (2020-12-02T16:16:26Z) - Characterizing Transactional Databases for Frequent Itemset Mining [0.0]
This paper presents a study of the characteristics of transactional databases used in frequent itemset mining.
Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones.
We provide a set of representative datasets based on our characterization that may be used as a benchmark safely.
arXiv Detail & Related papers (2020-11-09T12:26:14Z) - A Survey of Embedding Space Alignment Methods for Language and Knowledge
Graphs [77.34726150561087]
We survey the current research landscape on word, sentence and knowledge graph embedding algorithms.
We provide a classification of the relevant alignment techniques and discuss benchmark datasets used in this field of research.
arXiv Detail & Related papers (2020-10-26T16:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.