SPM: Structured Pretraining and Matching Architectures for Relevance
Modeling in Meituan Search
- URL: http://arxiv.org/abs/2308.07711v3
- Date: Sun, 27 Aug 2023 11:21:38 GMT
- Title: SPM: Structured Pretraining and Matching Architectures for Relevance
Modeling in Meituan Search
- Authors: Wen Zan, Yaopeng Han, Xiaotian Jiang, Yao Xiao, Yang Yang, Dayao Chen,
Sheng Chen
- Abstract summary: In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience.
We propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents.
The model has already been deployed online, serving the search traffic of Meituan for over a year.
- Score: 12.244685291395093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In e-commerce search, relevance between query and documents is an essential
requirement for satisfying user experience. Different from traditional
e-commerce platforms that offer products, users search on life service
platforms such as Meituan mainly for product providers, which usually have
abundant structured information, e.g. name, address, category, thousands of
products. Modeling search relevance with these rich structured contents is
challenging due to the following issues: (1) there is language distribution
discrepancy among different fields of structured document, making it difficult
to directly adopt off-the-shelf pretrained language model based methods like
BERT. (2) different fields usually have different importance and their length
vary greatly, making it difficult to extract document information helpful for
relevance matching.
To tackle these issues, in this paper we propose a novel two-stage
pretraining and matching architecture for relevance matching with rich
structured documents. At pretraining stage, we propose an effective pretraining
method that employs both query and multiple fields of document as inputs,
including an effective information compression method for lengthy fields. At
relevance matching stage, a novel matching method is proposed by leveraging
domain knowledge in search query to generate more effective document
representations for relevance scoring. Extensive offline experiments and online
A/B tests on millions of users verify that the proposed architectures
effectively improve the performance of relevance modeling. The model has
already been deployed online, serving the search traffic of Meituan for over a
year.
Related papers
- Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data.
Document parsing plays an indispensable role in both knowledge base construction and training data generation.
This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z) - Multi-Field Adaptive Retrieval [39.38972160512916]
We introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of document indices on structured data.
Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query.
We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured
arXiv Detail & Related papers (2024-10-26T03:07:22Z) - Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation [16.170841777591345]
In most social search scenarios such as Dianping, modeling search relevance always faces two challenges.
We first take queryd with the query-based summary and the document summary without query as the input of topic relevance model.
Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data.
arXiv Detail & Related papers (2024-04-03T10:05:47Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - From Easy to Hard: A Dual Curriculum Learning Framework for
Context-Aware Document Ranking [41.8396866002968]
We propose a curriculum learning framework for context-aware document ranking.
We aim to guide the model gradually toward a global optimum.
Experiments on two real query log datasets show that our proposed framework can improve the performance of several existing methods significantly.
arXiv Detail & Related papers (2022-08-22T12:09:12Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.