Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces
- URL: http://arxiv.org/abs/2512.21021v1
- Date: Wed, 24 Dec 2025 07:35:17 GMT
- Title: Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces
- Authors: Andre Rusli, Miao Cao, Shoma Ishimoto, Sho Akiyama, Max Frenzel,
- Abstract summary: We build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace.<n>To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings.
- Score: 3.8273208793317743
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.
Related papers
- Rerank Before You Reason: Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents [50.212640395029744]
We study how to allocate reasoning budget in deep search pipelines.<n>Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost.
arXiv Detail & Related papers (2026-01-20T18:38:35Z) - LLMs as Sparse Retrievers:A Framework for First-Stage Product Search [103.70006474544364]
Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day.<n>Sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios.<n>With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues.<n>We propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers.
arXiv Detail & Related papers (2025-10-21T11:13:21Z) - Generating Query-Relevant Document Summaries via Reinforcement Learning [5.651096645934245]
ReLSum is a reinforcement learning framework designed to generate query-relevant summaries of product descriptions optimized for search relevance.<n>The framework employs a trainable large language model (LLM) to produce summaries, which are then used as input for a cross-encoder ranking model.<n> Experimental results demonstrate significant improvements in offline metrics, including recall and NDCG, as well as online user engagement metrics.
arXiv Detail & Related papers (2025-08-11T18:52:28Z) - Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace [0.0]
This paper presents a scalable visual search system deployed in Mercari's C2C marketplace.<n>We evaluate recent vision-language models for zero-shot image retrieval and compare their performance with an existing fine-tuned baseline.
arXiv Detail & Related papers (2025-07-31T05:13:20Z) - NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking [14.008264174074487]
We propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR$2$.<n>Our approach achieves an improved performance over a smaller embedding dimension, compared to any existing models.
arXiv Detail & Related papers (2025-06-24T16:02:02Z) - Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search [3.392843594990172]
Traditional approaches for annotating query-product pairs rely on human-based labeling services.<n>We show that Large Language Models (LLMs) can approach human-level accuracy on this task in a fraction of the time and cost required by human-labelers.<n>This scalable alternative to human-annotation has significant implications for information retrieval domains.
arXiv Detail & Related papers (2025-02-21T22:59:36Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Relevance Filtering for Embedding-based Retrieval [46.851594313019895]
In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets.
This paper introduces a novel relevance filtering component (called "Cosine Adapter") for embedding-based retrieval to address this challenge.
We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall.
arXiv Detail & Related papers (2024-08-09T06:21:20Z) - Bridging the Domain Gaps in Context Representations for k-Nearest
Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains.
We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore.
Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z) - Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval.
We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English.
For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.