Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
- URL: http://arxiv.org/abs/2203.03367v1
- Date: Mon, 7 Mar 2022 13:20:46 GMT
- Title: Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
- Authors: Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie
Guo, Jian Xu, Guanjun Jiang, Luxi Xing, Ping Yang
- Abstract summary: We present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR)
The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical.
We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain.
- Score: 19.000263567641817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Passage retrieval is a fundamental task in information retrieval (IR)
research, which has drawn much attention recently. In English field, the
availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence
of deep pre-trained language models (e.g, BERT) have resulted in a substantial
improvement of existing passage retrieval systems. However, in Chinese field,
especially for specific domain, passage retrieval systems are still immature
due to quality-annotated dataset being limited by scale. Therefore, in this
paper, we present a novel multi-domain Chinese dataset for passage retrieval
(Multi-CPR). The dataset is collected from three different domains, including
E-commerce, Entertainment video and Medical. Each dataset contains millions of
passages and a certain amount of human annotated query-passage related pairs.
We implement various representative passage retrieval methods as baselines. We
find that the performance of retrieval models trained on dataset from general
domain will inevitably decrease on specific domain. Nevertheless, passage
retrieval system built on in-domain annotated dataset can achieve significant
improvement, which indeed demonstrates the necessity of domain labeled data for
further optimization. We hope the release of the Multi-CPR dataset could
benchmark Chinese passage retrieval task in specific domain and also make
advances for future studies.
Related papers
- MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction [11.458594744457521]
Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources.
One unanswered question is how far we are from having a single ADE extraction model that are effective on various types of text.
We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named MultiADE.
arXiv Detail & Related papers (2024-05-28T09:57:28Z) - A Dataset of Open-Domain Question Answering with Multiple-Span Answers [11.291635421662338]
Multi-span answer extraction, also known as the task of multi-span question answering (MSQA), is critical for real-world applications.
There is a notable lack of publicly available MSQA benchmark in Chinese.
We present CLEAN, a comprehensive Chinese multi-span question answering dataset.
arXiv Detail & Related papers (2024-02-15T13:03:57Z) - Bridging the Domain Gaps in Context Representations for k-Nearest
Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains.
We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore.
Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - Combining Data Generation and Active Learning for Low-Resource Question Answering [23.755283239897132]
We propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings.
Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting.
arXiv Detail & Related papers (2022-11-27T16:31:33Z) - Addressing Issues of Cross-Linguality in Open-Retrieval Question
Answering Systems For Emergent Domains [67.99403521976058]
We demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19.
Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable.
We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.
arXiv Detail & Related papers (2022-01-26T19:27:32Z) - Variational Attention: Propagating Domain-Specific Knowledge for
Multi-Domain Learning in Crowd Counting [75.80116276369694]
In crowd counting, due to the problem of laborious labelling, it is perceived intractability of collecting a new large-scale dataset.
We resort to the multi-domain joint learning and propose a simple but effective Domain-specific Knowledge Propagating Network (DKPNet)
It is mainly achieved by proposing the novel Variational Attention(VA) technique for explicitly modeling the attention distributions for different domains.
arXiv Detail & Related papers (2021-08-18T08:06:37Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.