Cross-Lingual Training with Dense Retrieval for Document Retrieval
- URL: http://arxiv.org/abs/2109.01628v1
- Date: Fri, 3 Sep 2021 17:15:38 GMT
- Title: Cross-Lingual Training with Dense Retrieval for Document Retrieval
- Authors: Peng Shi, Rui Zhang, He Bai, and Jimmy Lin
- Abstract summary: We explore different transfer techniques for document ranking from English annotations to multiple non-English languages.
Experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families.
We find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer.
- Score: 56.319511218754414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense retrieval has shown great success in passage ranking in English.
However, its effectiveness in document retrieval for non-English languages
remains unexplored due to the limitation in training resources. In this work,
we explore different transfer techniques for document ranking from English
annotations to multiple non-English languages. Our experiments on the test
collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish)
from diverse language families reveal that zero-shot model-based transfer using
mBERT improves the search quality in non-English mono-lingual retrieval. Also,
we find that weakly-supervised target language transfer yields competitive
performances against the generation-based target language transfer that
requires external translators and query generators.
Related papers
- CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - A Simple and Effective Method to Improve Zero-Shot Cross-Lingual
Transfer Learning [6.329304732560936]
Existing zero-shot cross-lingual transfer methods rely on parallel corpora or bilingual dictionaries.
We propose Embedding-Push, Attention-Pull, and Robust targets to transfer English embeddings to virtual multilingual embeddings without semantic loss.
arXiv Detail & Related papers (2022-10-18T15:36:53Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer [39.360667403003745]
Zero-shot cross-lingual transfer is emerging as a practical solution.
English is the dominant source language for transfer, as reinforced by popular zero-shot benchmarks.
We find that other high-resource languages such as German and Russian often transfer more effectively.
arXiv Detail & Related papers (2021-06-30T16:05:57Z) - Pivot Through English: Reliably Answering Multilingual Questions without
Document Retrieval [4.4973334555746]
Existing methods for open-retrieval question answering in lower resource languages (LRLs) lag significantly behind English.
We formulate a task setup more realistic to available resources, that circumvents document retrieval to reliably transfer knowledge from English to lower resource languages.
Within this task setup we propose Reranked Maximal Inner Product Search (RM-MIPS), akin to semantic similarity retrieval over the English training set with reranking.
arXiv Detail & Related papers (2020-12-28T04:38:45Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using
Zero-shot Learning [30.868309879441615]
We tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents.
Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish.
arXiv Detail & Related papers (2019-12-30T20:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.