Semi-Structured Query Grounding for Document-Oriented Databases with
Deep Retrieval and Its Application to Receipt and POI Matching
- URL: http://arxiv.org/abs/2202.13959v1
- Date: Wed, 23 Feb 2022 05:32:34 GMT
- Title: Semi-Structured Query Grounding for Document-Oriented Databases with
Deep Retrieval and Its Application to Receipt and POI Matching
- Authors: Geewook Kim, Wonseok Hwang, Minjoon Seo, Seunghyun Park
- Abstract summary: We aim to address practical challenges when using embedding-based retrieval for the query grounding problem in semi-structured data.
We conduct extensive experiments to find the most effective combination of modules for the embedding and retrieval of both query and database entries.
The proposed model significantly outperforms the conventional manual pattern-based model while requiring much less development and maintenance cost.
- Score: 23.52046767195031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-structured query systems for document-oriented databases have many real
applications. One particular application that we are interested in is matching
each financial receipt image with its corresponding place of interest (POI,
e.g., restaurant) in the nationwide database. The problem is especially
challenging in the real production environment where many similar or incomplete
entries exist in the database and queries are noisy (e.g., errors in optical
character recognition). In this work, we aim to address practical challenges
when using embedding-based retrieval for the query grounding problem in
semi-structured data. Leveraging recent advancements in deep language encoding
for retrieval, we conduct extensive experiments to find the most effective
combination of modules for the embedding and retrieval of both query and
database entries without any manually engineered component. The proposed model
significantly outperforms the conventional manual pattern-based model while
requiring much less development and maintenance cost. We also discuss some core
observations in our experiments, which could be helpful for practitioners
working on a similar problem in other domains.
Related papers
- BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
Many complex real-world queries require in-depth reasoning to identify relevant documents.
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation [16.170841777591345]
In most social search scenarios such as Dianping, modeling search relevance always faces two challenges.
We first take queryd with the query-based summary and the document summary without query as the input of topic relevance model.
Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data.
arXiv Detail & Related papers (2024-04-03T10:05:47Z) - SPM: Structured Pretraining and Matching Architectures for Relevance
Modeling in Meituan Search [12.244685291395093]
In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience.
We propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents.
The model has already been deployed online, serving the search traffic of Meituan for over a year.
arXiv Detail & Related papers (2023-08-15T11:45:34Z) - AskYourDB: An end-to-end system for querying and visualizing relational
databases using natural language [0.0]
We propose a semantic parsing approach to address the challenge of converting complex natural language into SQL.
We modified state-of-the-art models, by various pre and post processing steps which make the significant part when a model is deployed in production.
To make the product serviceable to businesses we added an automatic visualization framework over the queried results.
arXiv Detail & Related papers (2022-10-16T13:31:32Z) - Towards a Natural Language Query Processing System [0.0]
This paper reports our study on the design and development of a natural language query interface to a backend relational database.
The novelty in the study lies in defining a graph database as a middle layer to store necessary metadata needed to transform a natural language query into structured query language.
The translation results for some sample queries yielded a 90% accuracy rate.
arXiv Detail & Related papers (2020-09-25T19:52:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.