MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
- URL: http://arxiv.org/abs/2405.07526v1
- Date: Mon, 13 May 2024 07:46:44 GMT
- Title: MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
- Authors: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang,
- Abstract summary: We introduce MS MARCO Web Search, the first large-scale information-rich web dataset.
This dataset mimics real-world web document and query distribution.
MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks.
- Score: 95.48844474720798
- License:
- Abstract: Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.
Related papers
- Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records.
To address this gap, we created a large-scale, open-access dataset specifically designed for list pages.
Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z) - Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents [16.161877699225986]
We develop a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date.
This dataset contains over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements.
We demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++.
arXiv Detail & Related papers (2025-02-17T02:13:48Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - A Responsive Framework for Research Portals Data using Semantic Web
Technology [0.6798775532273751]
The research aims to address this issue by designing a framework for the semantic organization of research portal data.
The framework focuses on the extraction of information from two specific research portals, namely Microsoft Academic and IEEE Xplore.
arXiv Detail & Related papers (2023-06-20T16:12:33Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - SnapMode: An Intelligent and Distributed Large-Scale Fashion Image
Retrieval Platform Based On Big Data and Deep Generative Adversarial Network
Technologies [2.280980014008583]
It is nearly impossible for humans to manually catch up with the ongoing evolution and the continuously variable content in this domain.
This paper first proposes a scalable focused Web engine based on the distributed computing platforms to extract and process fashion data on e-commerce websites.
For the real-life implementation of the proposed solution, a Web-based application is developed on Apache Storm, Kafka, Solr, and Milvus platforms to create a fashion search engine called SnapMode.
arXiv Detail & Related papers (2022-04-08T11:08:03Z) - MIRA: Leveraging Multi-Intention Co-click Information in Web-scale
Document Retrieval using Deep Neural Networks [5.963438927897287]
We study the problem of deep recall model in industrial web search.
We propose a web-scale Multi-Intention Co-click document Graph.
We also present an encoding framework MIRA based on Bert and graph attention networks.
arXiv Detail & Related papers (2020-07-03T06:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.