MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
- URL: http://arxiv.org/abs/2405.07526v1
- Date: Mon, 13 May 2024 07:46:44 GMT
- Title: MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
- Authors: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang,
- Abstract summary: We introduce MS MARCO Web Search, the first large-scale information-rich web dataset.
This dataset mimics real-world web document and query distribution.
MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks.
- Score: 95.48844474720798
- License:
- Abstract: Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.
Related papers
- Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B.
Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages.
We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z) - A Responsive Framework for Research Portals Data using Semantic Web
Technology [0.6798775532273751]
The research aims to address this issue by designing a framework for the semantic organization of research portal data.
The framework focuses on the extraction of information from two specific research portals, namely Microsoft Academic and IEEE Xplore.
arXiv Detail & Related papers (2023-06-20T16:12:33Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - SnapMode: An Intelligent and Distributed Large-Scale Fashion Image
Retrieval Platform Based On Big Data and Deep Generative Adversarial Network
Technologies [2.280980014008583]
It is nearly impossible for humans to manually catch up with the ongoing evolution and the continuously variable content in this domain.
This paper first proposes a scalable focused Web engine based on the distributed computing platforms to extract and process fashion data on e-commerce websites.
For the real-life implementation of the proposed solution, a Web-based application is developed on Apache Storm, Kafka, Solr, and Milvus platforms to create a fashion search engine called SnapMode.
arXiv Detail & Related papers (2022-04-08T11:08:03Z) - A Large Visual, Qualitative and Quantitative Dataset of Web Pages [4.5002924206836]
We have created a large dataset of 49,438 Web pages.
It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics.
arXiv Detail & Related papers (2021-05-15T01:31:25Z) - MIRA: Leveraging Multi-Intention Co-click Information in Web-scale
Document Retrieval using Deep Neural Networks [5.963438927897287]
We study the problem of deep recall model in industrial web search.
We propose a web-scale Multi-Intention Co-click document Graph.
We also present an encoding framework MIRA based on Bert and graph attention networks.
arXiv Detail & Related papers (2020-07-03T06:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.