Related papers: Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces

Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces

URL: http://arxiv.org/abs/2512.21021v1
Date: Wed, 24 Dec 2025 07:35:17 GMT
Title: Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces
Authors: Andre Rusli, Miao Cao, Shoma Ishimoto, Sho Akiyama, Max Frenzel,
Abstract summary: We build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace.<n>To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings.
Score: 3.8273208793317743
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.

Related papers

Rerank Before You Reason: Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents [50.212640395029744]
We study how to allocate reasoning budget in deep search pipelines.<n>Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost.
arXiv Detail & Related papers (2026-01-20T18:38:35Z)
LLMs as Sparse Retrievers:A Framework for First-Stage Product Search [103.70006474544364]
Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day.<n>Sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios.<n>With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues.<n>We propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers.
arXiv Detail & Related papers (2025-10-21T11:13:21Z)
Generating Query-Relevant Document Summaries via Reinforcement Learning [5.651096645934245]
ReLSum is a reinforcement learning framework designed to generate query-relevant summaries of product descriptions optimized for search relevance.<n>The framework employs a trainable large language model (LLM) to produce summaries, which are then used as input for a cross-encoder ranking model.<n> Experimental results demonstrate significant improvements in offline metrics, including recall and NDCG, as well as online user engagement metrics.
arXiv Detail & Related papers (2025-08-11T18:52:28Z)
Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace [0.0]
This paper presents a scalable visual search system deployed in Mercari's C2C marketplace.<n>We evaluate recent vision-language models for zero-shot image retrieval and compare their performance with an existing fine-tuned baseline.
arXiv Detail & Related papers (2025-07-31T05:13:20Z)
NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking [14.008264174074487]
We propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR$2$.<n>Our approach achieves an improved performance over a smaller embedding dimension, compared to any existing models.
arXiv Detail & Related papers (2025-06-24T16:02:02Z)
Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search [3.392843594990172]
Traditional approaches for annotating query-product pairs rely on human-based labeling services.<n>We show that Large Language Models (LLMs) can approach human-level accuracy on this task in a fraction of the time and cost required by human-labelers.<n>This scalable alternative to human-annotation has significant implications for information retrieval domains.
arXiv Detail & Related papers (2025-02-21T22:59:36Z)
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description. Existing works mainly focus on case-to-case retrieval using lengthy queries. Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z)
Relevance Filtering for Embedding-based Retrieval [46.851594313019895]
In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets. This paper introduces a novel relevance filtering component (called "Cosine Adapter") for embedding-based retrieval to address this challenge. We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall.
arXiv Detail & Related papers (2024-08-09T06:21:20Z)
Bridging the Domain Gaps in Context Representations for k-Nearest Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains. We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore. Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z)
Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.