A Comparative Study of Text Retrieval Models on DaReCzech
- URL: http://arxiv.org/abs/2411.12921v1
- Date: Tue, 19 Nov 2024 23:19:46 GMT
- Title: A Comparative Study of Text Retrieval Models on DaReCzech
- Authors: Jakub Stetina, Martin Fajcik, Michal Stefanik, Michal Hradis,
- Abstract summary: This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2.
The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language.
- Score: 1.4582718436069808
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
Related papers
- SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization [64.95852289011385]
Large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved.<n> evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.<n>We propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection.
arXiv Detail & Related papers (2026-02-08T11:12:45Z) - Large Language Models for the Summarization of Czech Documents: From History to the Present [2.124799222903955]
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information.<n>This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets.<n>We address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5.<n>We also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech.
arXiv Detail & Related papers (2025-11-24T07:40:31Z) - Large Language Models for Summarizing Czech Historical Documents and Beyond [1.4680035572775534]
summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information.<n>We employ large language models such as Mistral and mT5 to achieve new state-of-the-art results on the modern Czech summarization dataset SumeCzech.<n>We introduce a novel dataset called Posel od vCerchova for summarization of historical Czech documents with baseline results.
arXiv Detail & Related papers (2025-08-14T06:07:49Z) - BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism [30.267465719961585]
BenCzechMark (BCM) is the first comprehensive Czech language benchmark designed for large language models.
Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 11 newly collected ones.
These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word.
arXiv Detail & Related papers (2024-12-23T19:45:20Z) - AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - Advancing Translation Preference Modeling with RLHF: A Step Towards
Cost-Effective Solution [57.42593422091653]
We explore leveraging reinforcement learning with human feedback to improve translation quality.
A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality.
arXiv Detail & Related papers (2024-02-18T09:51:49Z) - Some Like It Small: Czech Semantic Embedding Models for Industry
Applications [0.0]
This article focuses on the development and evaluation of Small-sized Czech sentence embedding models.
Small models are important components for real-time industry applications in resource-constrained environments.
Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine.
arXiv Detail & Related papers (2023-11-23T11:14:13Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text
Retrieval [55.097573036580066]
Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models.
Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
arXiv Detail & Related papers (2022-03-11T18:53:12Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a
New Czech Dataset [0.0]
We present our real-time approach to the document ranking problem leveraging a BERT-based siamese architecture.
We release DaReCzech, a unique data set of 1.6 million Czech user query-document pairs with manually assigned relevance levels.
We also release Small-E-Czech, an Electra-small language model pre-trained on a large Czech corpus.
arXiv Detail & Related papers (2021-12-03T09:45:18Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Reading Comprehension in Czech via Machine Translation and Cross-lingual
Transfer [2.8273701718153563]
This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data.
We automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data.
We then trained and evaluated several BERT and XLM-RoBERTa baseline models.
arXiv Detail & Related papers (2020-07-03T13:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.