Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
- URL: http://arxiv.org/abs/2408.16672v4
- Date: Sat, 14 Sep 2024 07:41:06 GMT
- Title: Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
- Authors: Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram, Nan Wang, Han Xiao,
- Abstract summary: ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders.
Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.
- Score: 6.221757399678299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - SPLATE: Sparse Late Interaction Retrieval [13.607085390630647]
SPLATE is a lightweight adaptation of the ColBERTv2 model which learns an MLM adapter''
Our pipeline achieves the same effectiveness as the PLAID ColBERTv2 engine by re-ranking 50 documents that can be retrieved under 10ms.
arXiv Detail & Related papers (2024-04-22T07:51:13Z) - Beyond Two-Tower Matching: Learning Sparse Retrievable
Cross-Interactions for Recommendation [80.19762472699814]
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications.
It suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving.
We propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval.
arXiv Detail & Related papers (2023-11-30T03:13:36Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized
Late Interactions using Enhanced Reduction [10.749746283569847]
ColBERTer is a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction.
For its multi-vector component, ColBERTer reduces the number of stored per document by learning unique whole-word representations for the terms in each document.
Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness.
arXiv Detail & Related papers (2022-03-24T14:28:07Z) - Leveraging Advantages of Interactive and Non-Interactive Models for
Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models.
We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries.
Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - ColBERT: Efficient and Effective Passage Search via Contextualized Late
Interaction over BERT [24.288824715337483]
ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.
We extensively evaluate ColBERT using two recent passage search datasets.
arXiv Detail & Related papers (2020-04-27T14:21:03Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.