Enhancing Model Performance in Multilingual Information Retrieval with
Comprehensive Data Engineering Techniques
- URL: http://arxiv.org/abs/2302.07010v1
- Date: Tue, 14 Feb 2023 12:37:32 GMT
- Title: Enhancing Model Performance in Multilingual Information Retrieval with
Comprehensive Data Engineering Techniques
- Authors: Qi Zhang, Zijian Yang, Yilun Huang, Ze Chen, Zijian Cai, Kangxu Wang,
Jiewen Zheng, Jiarong He, Jin Gao
- Abstract summary: We fine-tune pre-trained multilingual transformer-based models with MIRACL dataset.
Our model improvement is mainly achieved through diverse data engineering techniques.
We secure 2nd place in the Surprise-Languages track with a score of 0.835 and 3rd place in the Known-Languages track with an average nDCG@10 score of 0.716 across the 16 known languages on the final leaderboard.
- Score: 10.57012904999091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our solution to the Multilingual Information
Retrieval Across a Continuum of Languages (MIRACL) challenge of WSDM CUP
2023\footnote{https://project-miracl.github.io/}. Our solution focuses on
enhancing the ranking stage, where we fine-tune pre-trained multilingual
transformer-based models with MIRACL dataset. Our model improvement is mainly
achieved through diverse data engineering techniques, including the collection
of additional relevant training data, data augmentation, and negative sampling.
Our fine-tuned model effectively determines the semantic relevance between
queries and documents, resulting in a significant improvement in the efficiency
of the multilingual information retrieval process. Finally, Our team is pleased
to achieve remarkable results in this challenging competition, securing 2nd
place in the Surprise-Languages track with a score of 0.835 and 3rd place in
the Known-Languages track with an average nDCG@10 score of 0.716 across the 16
known languages on the final leaderboard.
Related papers
- HYBRINFOX at CheckThat! 2024 -- Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation [0.8083061106940517]
This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition.
We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples.
arXiv Detail & Related papers (2024-07-04T11:33:54Z) - CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment [38.35458193262633]
English-centric models are usually suboptimal in other languages.
We propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data.
arXiv Detail & Related papers (2024-04-18T06:20:50Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - KInITVeraAI at SemEval-2023 Task 3: Simple yet Powerful Multilingual
Fine-Tuning for Persuasion Techniques Detection [0.0]
This paper presents the best-performing solution to the SemEval 2023 Task 3 on the subtask 3 dedicated to persuasion techniques detection.
Due to a high multilingual character of the input data and a large number of 23 predicted labels, we opted for fine-tuning pre-trained transformer-based language models.
arXiv Detail & Related papers (2023-04-24T09:06:43Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.