Cross-Lingual Document Retrieval with Smooth Learning
- URL: http://arxiv.org/abs/2011.00701v1
- Date: Mon, 2 Nov 2020 03:17:39 GMT
- Title: Cross-Lingual Document Retrieval with Smooth Learning
- Authors: Jiapeng Liu, Xiao Zhang, Dan Goldwasser, Xiao Wang
- Abstract summary: Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
- Score: 31.638708227607214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual document search is an information retrieval task in which the
queries' language differs from the documents' language. In this paper, we study
the instability of neural document search models and propose a novel end-to-end
robust framework that achieves improved performance in cross-lingual search
with different documents' languages. This framework includes a novel measure of
the relevance, smooth cosine similarity, between queries and documents, and a
novel loss function, Smooth Ordinal Search Loss, as the objective. We further
provide theoretical guarantee on the generalization error bound for the
proposed framework. We conduct experiments to compare our approach with other
document search models, and observe significant gains under commonly used
ranking metrics on the cross-lingual document retrieval task in a variety of
languages.
Related papers
- Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples [38.18495961129682]
This paper introduces a novel cross-lingual search task that does not require a large semantic corpus.
It focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than challenging distractors generated by a large language model.
We create a case study of our introduced CLSD task for the language pair German-French in the news domain.
arXiv Detail & Related papers (2025-02-12T18:54:37Z) - DOGR: Leveraging Document-Oriented Contrastive Learning in Generative Retrieval [10.770281363775148]
We propose a novel and general generative retrieval framework, namely Leveraging Document-Oriented Contrastive Learning in Generative Retrieval (DOGR)
It adopts a two-stage learning strategy that captures the relationship between queries and documents comprehensively through direct interactions.
Negative sampling methods and corresponding contrastive learning objectives are implemented to enhance the learning of semantic representations.
arXiv Detail & Related papers (2025-02-11T03:25:42Z) - Optimizing Multi-Stage Language Models for Effective Text Retrieval [0.0]
We introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
Our method leverages advanced language models to achieve state-of-the-art performance.
To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies.
arXiv Detail & Related papers (2024-12-26T16:05:19Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.
We merge the representations of segmented passages into one single document representation.
We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - From Easy to Hard: A Dual Curriculum Learning Framework for
Context-Aware Document Ranking [41.8396866002968]
We propose a curriculum learning framework for context-aware document ranking.
We aim to guide the model gradually toward a global optimum.
Experiments on two real query log datasets show that our proposed framework can improve the performance of several existing methods significantly.
arXiv Detail & Related papers (2022-08-22T12:09:12Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.