Cross-Lingual Document Retrieval with Smooth Learning
- URL: http://arxiv.org/abs/2011.00701v1
- Date: Mon, 2 Nov 2020 03:17:39 GMT
- Title: Cross-Lingual Document Retrieval with Smooth Learning
- Authors: Jiapeng Liu, Xiao Zhang, Dan Goldwasser, Xiao Wang
- Abstract summary: Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
- Score: 31.638708227607214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual document search is an information retrieval task in which the
queries' language differs from the documents' language. In this paper, we study
the instability of neural document search models and propose a novel end-to-end
robust framework that achieves improved performance in cross-lingual search
with different documents' languages. This framework includes a novel measure of
the relevance, smooth cosine similarity, between queries and documents, and a
novel loss function, Smooth Ordinal Search Loss, as the objective. We further
provide theoretical guarantee on the generalization error bound for the
proposed framework. We conduct experiments to compare our approach with other
document search models, and observe significant gains under commonly used
ranking metrics on the cross-lingual document retrieval task in a variety of
languages.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Detecting Structured Language Alternations in Historical Documents by
Combining Language Identification with Fourier Analysis [0.0]
We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.
We present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish.
arXiv Detail & Related papers (2024-01-25T23:54:34Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - From Easy to Hard: A Dual Curriculum Learning Framework for
Context-Aware Document Ranking [41.8396866002968]
We propose a curriculum learning framework for context-aware document ranking.
We aim to guide the model gradually toward a global optimum.
Experiments on two real query log datasets show that our proposed framework can improve the performance of several existing methods significantly.
arXiv Detail & Related papers (2022-08-22T12:09:12Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Bilingual Topic Models for Comparable Corpora [9.509416095106491]
We propose a binding mechanism between the distributions of the paired documents.
To estimate the similarity of documents that are written in different languages we use cross-lingual word embeddings that are learned with shallow neural networks.
We evaluate the proposed binding mechanism by extending two topic models: a bilingual adaptation of LDA that assumes bag-of-words inputs and a model that incorporates part of the text structure in the form of boundaries of semantically coherent segments.
arXiv Detail & Related papers (2021-11-30T10:53:41Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines [0.0]
We present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder.
This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy.
arXiv Detail & Related papers (2021-03-01T07:19:16Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.