Integrity and Junkiness Failure Handling for Embedding-based Retrieval:
A Case Study in Social Network Search
- URL: http://arxiv.org/abs/2304.09287v1
- Date: Tue, 18 Apr 2023 20:53:47 GMT
- Title: Integrity and Junkiness Failure Handling for Embedding-based Retrieval:
A Case Study in Social Network Search
- Authors: Wenping Wang, Yunxi Guo, Chiyao Shen, Shuai Ding, Guangdeng Liao, Hao
Fu, Pramodh Karanth Prabhakar
- Abstract summary: Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc.
In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine.
We define two main categories of failures introduced by it, integrity and junkiness.
- Score: 26.705196461992845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embedding based retrieval has seen its usage in a variety of search
applications like e-commerce, social networking search etc. While the approach
has demonstrated its efficacy in tasks like semantic matching and contextual
search, it is plagued by the problem of uncontrollable relevance. In this
paper, we conduct an analysis of embedding-based retrieval launched in early
2021 on our social network search engine, and define two main categories of
failures introduced by it, integrity and junkiness. The former refers to issues
such as hate speech and offensive content that can severely harm user
experience, while the latter includes irrelevant results like fuzzy text
matching or language mismatches. Efficient methods during model inference are
further proposed to resolve the issue, including indexing treatments and
targeted user cohort treatments, etc. Though being simple, we show the methods
have good offline NDCG and online A/B tests metrics gain in practice. We
analyze the reasons for the improvements, pointing out that our methods are
only preliminary attempts to this important but challenging problem. We put
forward potential future directions to explore.
Related papers
- VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and
Optimized Search [1.0411820336052784]
We propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval.
By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy.
Experiments on real-world datasets show that VectorSearch outperforms baseline metrics.
arXiv Detail & Related papers (2024-09-25T21:58:08Z) - Robust Candidate Generation for Entity Linking on Short Social Media
Texts [1.5006258585503875]
We show that in the domain of Tweets, such methods suffer as users often include informal spelling, limited context, and lack of specificity.
We demonstrate a hybrid solution using long contextual representation from Wikipedia, achieving 0.93 recall.
arXiv Detail & Related papers (2022-10-14T02:47:31Z) - Semantic Search for Large Scale Clinical Ontologies [63.71950996116403]
We present a deep learning approach to build a search system for large clinical vocabularies.
We propose a Triplet-BERT model and a method that generates training data based on semantic training data.
The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to searching concept vocabularies.
arXiv Detail & Related papers (2022-01-01T05:15:42Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - On the Social and Technical Challenges of Web Search Autosuggestion
Moderation [118.47867428272878]
Autosuggestions are typically generated by machine learning (ML) systems trained on a corpus of search logs and document representations.
While current search engines have become increasingly proficient at suppressing such problematic suggestions, there are still persistent issues that remain.
We discuss several dimensions of problematic suggestions, difficult issues along the pipeline, and why our discussion applies to the increasing number of applications beyond web search.
arXiv Detail & Related papers (2020-07-09T19:22:00Z) - Mining Implicit Relevance Feedback from User Behavior for Web Question
Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance.
Our approach significantly improves the accuracy of passage ranking without extra human labeled data.
In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z) - Leveraging Cognitive Search Patterns to Enhance Automated Natural
Language Retrieval Performance [0.0]
We show that cognitive reformulation patterns that mimic user search behaviour are highlighted.
We formalize the application of these patterns by considering a query conceptual representation.
A genetic algorithm-based weighting process allows placing emphasis on terms according to their conceptual role-type.
arXiv Detail & Related papers (2020-04-21T14:13:33Z) - WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types.
This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches.
We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.