Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a
New Czech Dataset
- URL: http://arxiv.org/abs/2112.01810v1
- Date: Fri, 3 Dec 2021 09:45:18 GMT
- Title: Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a
New Czech Dataset
- Authors: Mat\v{e}j Koci\'an, Jakub N\'aplava, Daniel \v{S}tancl, Vladim\'ir
Kadlec
- Abstract summary: We present our real-time approach to the document ranking problem leveraging a BERT-based siamese architecture.
We release DaReCzech, a unique data set of 1.6 million Czech user query-document pairs with manually assigned relevance levels.
We also release Small-E-Czech, an Electra-small language model pre-trained on a large Czech corpus.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web search engines focus on serving highly relevant results within hundreds
of milliseconds. Pre-trained language transformer models such as BERT are
therefore hard to use in this scenario due to their high computational demands.
We present our real-time approach to the document ranking problem leveraging a
BERT-based siamese architecture. The model is already deployed in a commercial
search engine and it improves production performance by more than 3%. For
further research and evaluation, we release DaReCzech, a unique data set of 1.6
million Czech user query-document pairs with manually assigned relevance
levels. We also release Small-E-Czech, an Electra-small language model
pre-trained on a large Czech corpus. We believe this data will support
endeavours both of search relevance and multilingual-focused research
communities.
Related papers
- Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)
MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.
We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - A Comparative Study of Text Retrieval Models on DaReCzech [1.4582718436069808]
This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2.
The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language.
arXiv Detail & Related papers (2024-11-19T23:19:46Z) - AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval.
We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English.
For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - Leveraging Advantages of Interactive and Non-Interactive Models for
Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models.
We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries.
Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z) - BERTa\'u: Ita\'u BERT for digital customer service [0.0]
We introduce a new Portuguese financial domain language representation model called BERTa'u.
Our novel contribution is that BERTa'u pretrained language model requires less data, reached state-of-the-art performance in three NLP tasks, and generates a smaller and lighter model that makes the deployment feasible.
arXiv Detail & Related papers (2021-01-28T14:29:03Z) - Nearest Neighbor Machine Translation [113.96357168879548]
We introduce $k$-nearest-neighbor machine translation ($k$NN-MT)
It predicts tokens with a nearest neighbor classifier over a large datastore of cached examples.
It consistently improves performance across many settings.
arXiv Detail & Related papers (2020-10-01T22:24:46Z) - AutoRC: Improving BERT Based Relation Classification Models via
Architecture Search [50.349407334562045]
BERT based relation classification (RC) models have achieved significant improvements over the traditional deep learning models.
No consensus can be reached on what is the optimal architecture.
We design a comprehensive search space for BERT based RC models and employ neural architecture search (NAS) method to automatically discover the design choices.
arXiv Detail & Related papers (2020-09-22T16:55:49Z) - Cross-Lingual Low-Resource Set-to-Description Retrieval for Global
E-Commerce [83.72476966339103]
Cross-lingual information retrieval is a new task in cross-border e-commerce.
We propose a novel cross-lingual matching network (CLMN) with the enhancement of context-dependent cross-lingual mapping.
Experimental results indicate that our proposed CLMN yields impressive results on the challenging task.
arXiv Detail & Related papers (2020-05-17T08:10:51Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.