HC4: A New Suite of Test Collections for Ad Hoc CLIR
- URL: http://arxiv.org/abs/2201.09992v1
- Date: Mon, 24 Jan 2022 22:52:11 GMT
- Title: HC4: A New Suite of Test Collections for Ad Hoc CLIR
- Authors: Dawn Lawrie and James Mayfield and Douglas Oard and Eugene Yang
- Abstract summary: HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval.
The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian.
Documents were judged on a three-grade relevance scale.
- Score: 3.816529552690824
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: HC4 is a new suite of test collections for ad hoc Cross-Language Information
Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and
Russian, topics in English and in the document languages, and graded relevance
judgments. New test collections are needed because existing CLIR test
collections built using pooling of traditional CLIR runs have systematic gaps
in their relevance judgments when used to evaluate neural CLIR methods. The HC4
collections contain 60 topics and about half a million documents for each of
Chinese and Persian, and 54 topics and five million documents for Russian.
Active learning was used to determine which documents to annotate after being
seeded using interactive search and judgment. Documents were judged on a
three-grade relevance scale. This paper describes the design and construction
of the new test collections and provides baseline results for demonstrating
their utility for evaluating systems.
Related papers
- NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain [49.3943974580576]
This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information.<n>The collection consists of technical documents written in Chinese and those same documents machine translated into English.<n>The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language.
arXiv Detail & Related papers (2026-02-05T05:57:55Z) - NeuCLIRBench: A Modern Evaluation Collection for Monolingual, Cross-Language, and Multilingual Information Retrieval [39.153319100127845]
This paper presents NeuCLIRBench, an evaluation collection for cross-language and multilingual retrieval.<n>The collection consists of documents written in Chinese, Persian, and Russian, as well as those same documents machine translated into English.<n>The collection supports several retrieval scenarios including: monolingual retrieval in English, Chinese, Persian, or Russian.
arXiv Detail & Related papers (2025-11-18T18:58:19Z) - Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval [5.199807441687141]
Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval ( CLIR)<n>This report provides a detailed description of Information Sciences Institute's (ISI's) Summarization and domain-Adaptive Retrieval Across Language (SARAL's) effort for evaluation.<n>We outline our team's novel approach to handle CLIR with emphasis in developing an approach to retrieve a query-relevant document textitset, and not just a ranked document-list.
arXiv Detail & Related papers (2025-11-05T06:35:33Z) - Overview of the TREC 2024 NeuCLIR Track [43.84164712459855]
The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the effect of neural approaches on cross-language information access.<n>NeuCLIR includes four task types: Cross-Language Information Retrieval (CLIR) from news, Multilingual Information Retrieval (MLIR) from news, Report Generation from news, and CLIR from technical documents.
arXiv Detail & Related papers (2025-09-17T18:36:38Z) - MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query [55.486895951981566]
MERIT is the first multilingual dataset for interleaved multi-condition semantic retrieval.<n>This paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval.
arXiv Detail & Related papers (2025-06-03T17:59:14Z) - CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents [2.0277446818410994]
This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search.
The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
arXiv Detail & Related papers (2025-04-22T20:55:08Z) - GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems [0.33748750222488655]
GenTREC is the first test collection constructed entirely from documents generated by a Large Language Model (LLM)
We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant.
The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments"
arXiv Detail & Related papers (2025-01-05T00:27:36Z) - Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents [60.348103523743276]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.<n>Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - Multilingual ColBERT-X [11.768656900939048]
ColBERT-X is a dense retrieval model for Cross Language Information Retrieval ( CLIR)
In CLIR, documents are written in one natural language, while the queries are expressed in another.
A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages.
arXiv Detail & Related papers (2022-09-03T06:02:52Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Detecting Cross-Language Plagiarism using Open Knowledge Graphs [7.378348990383349]
We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis.
CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata.
It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections.
arXiv Detail & Related papers (2021-11-18T15:23:27Z) - Cross-Lingual Training with Dense Retrieval for Document Retrieval [56.319511218754414]
We explore different transfer techniques for document ranking from English annotations to multiple non-English languages.
Experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families.
We find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer.
arXiv Detail & Related papers (2021-09-03T17:15:38Z) - Cross-Lingual Document Retrieval with Smooth Learning [31.638708227607214]
Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
arXiv Detail & Related papers (2020-11-02T03:17:39Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.