Related papers: Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

URL: http://arxiv.org/abs/2203.03367v1
Date: Mon, 7 Mar 2022 13:20:46 GMT
Title: Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
Authors: Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jian Xu, Guanjun Jiang, Luxi Xing, Ping Yang
Abstract summary: We present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR) The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain.
Score: 19.000263567641817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) have resulted in a substantial improvement of existing passage retrieval systems. However, in Chinese field, especially for specific domain, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies.

Related papers

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora [6.594531626178451]
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages.<n>We study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets.<n>We propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages.
arXiv Detail & Related papers (2025-07-10T08:38:31Z)
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language [4.5224851085910585]
Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language.
arXiv Detail & Related papers (2024-12-13T09:47:26Z)
MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction [11.458594744457521]
Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources. Most datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation - the ability of a machine learning model to perform well on new, unseen domains (text types) - is under-explored. We build a benchmark for adverse drug event extraction, which we named MultiADE.
arXiv Detail & Related papers (2024-05-28T09:57:28Z)
A Dataset of Open-Domain Question Answering with Multiple-Span Answers [11.291635421662338]
Multi-span answer extraction, also known as the task of multi-span question answering (MSQA), is critical for real-world applications. There is a notable lack of publicly available MSQA benchmark in Chinese. We present CLEAN, a comprehensive Chinese multi-span question answering dataset.
arXiv Detail & Related papers (2024-02-15T13:03:57Z)
Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z)
Bridging the Domain Gaps in Context Representations for k-Nearest Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains. We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore. Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z)
NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z)
Combining Data Generation and Active Learning for Low-Resource Question Answering [23.755283239897132]
We propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings. Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting.
arXiv Detail & Related papers (2022-11-27T16:31:33Z)
Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains [67.99403521976058]
We demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.
arXiv Detail & Related papers (2022-01-26T19:27:32Z)
Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting [75.80116276369694]
In crowd counting, due to the problem of laborious labelling, it is perceived intractability of collecting a new large-scale dataset. We resort to the multi-domain joint learning and propose a simple but effective Domain-specific Knowledge Propagating Network (DKPNet) It is mainly achieved by proposing the novel Variational Attention(VA) technique for explicitly modeling the attention distributions for different domains.
arXiv Detail & Related papers (2021-08-18T08:06:37Z)
FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT) The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone. We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.