A Comprehensive Survey on Vector Database: Storage and Retrieval
Technique, Challenge
- URL: http://arxiv.org/abs/2310.11703v1
- Date: Wed, 18 Oct 2023 04:31:06 GMT
- Title: A Comprehensive Survey on Vector Database: Storage and Retrieval
Technique, Challenge
- Authors: Yikun Han, Chunjiang Liu, Pengfei Wang
- Abstract summary: The approximate nearest neighbor search problem behind vector databases has been studied for a long time.
This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area.
- Score: 4.579314354865921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A vector database is used to store high-dimensional data that cannot be
characterized by traditional DBMS. Although there are not many articles
describing existing or introducing new vector database architectures, the
approximate nearest neighbor search problem behind vector databases has been
studied for a long time, and considerable related algorithmic articles can be
found in the literature. This article attempts to comprehensively review
relevant algorithms to provide a general understanding of this booming research
area. The basis of our framework categorises these studies by the approach of
solving ANNS problem, respectively hash-based, tree-based, graph-based and
quantization-based approaches. Then we present an overview of existing
challenges for vector databases. Lastly, we sketch how vector databases can be
combined with large language models and provide new possibilities.
Related papers
- A Survey on Computational Solutions for Reconstructing Complete Objects by Reassembling Their Fractured Parts [25.59032022422813]
Reconstructing a complete object from its parts is a fundamental problem in many scientific domains.
We provide existing algorithms in this context and emphasize their similarities and differences to general-purpose approaches.
In addition to algorithms, this survey will also describe existing datasets, open-source software packages, and applications.
arXiv Detail & Related papers (2024-10-18T17:53:07Z) - Dissecting embedding method: learning higher-order structures from data [0.0]
Geometric deep learning methods for data learning often include set of assumptions on the geometry of the feature space.
These assumptions together with data being discrete and finite can cause some generalisations, which are likely to create wrong interpretations of the data and models outputs.
arXiv Detail & Related papers (2024-10-14T08:19:39Z) - BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - Using text embedding models and vector databases as text classifiers
with the example of medical data [0.0]
We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine.
We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
arXiv Detail & Related papers (2024-02-07T22:15:15Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors [58.340159346749964]
We propose a new neural-symbolic method to support end-to-end learning using complex queries with provable reasoning capability.
We develop a new dataset containing ten new types of queries with features that have never been considered.
Our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.
arXiv Detail & Related papers (2023-04-14T11:35:35Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Complex Coordinate-Based Meta-Analysis with Probabilistic Programming [0.0]
Coordinate-based meta-analysis (CBMA) databases are built by automatically extracting both coordinates of reported peak activations and term associations.
We show how recent lifted query processing algorithms make it possible to scale to the size of large neuroimaging data.
We demonstrate results for two-term conjunctive queries, both on simulated meta-analysis databases and on the widely-used Neurosynth database.
arXiv Detail & Related papers (2020-12-02T16:16:26Z) - Characterizing Transactional Databases for Frequent Itemset Mining [0.0]
This paper presents a study of the characteristics of transactional databases used in frequent itemset mining.
Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones.
We provide a set of representative datasets based on our characterization that may be used as a benchmark safely.
arXiv Detail & Related papers (2020-11-09T12:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.