Related papers: MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

URL: http://arxiv.org/abs/2405.07526v1
Date: Mon, 13 May 2024 07:46:44 GMT
Title: MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Authors: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang,
Abstract summary: We introduce MS MARCO Web Search, the first large-scale information-rich web dataset. This dataset mimics real-world web document and query distribution. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks.
Score: 95.48844474720798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.

Related papers

WebDS: An End-to-End Benchmark for Web-based Data Science [59.270670758607494]
WebDS is the first end-to-end web-based data science benchmark.<n>It comprises 870 web-based data science tasks across 29 diverse websites.<n>WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
arXiv Detail & Related papers (2025-08-02T06:39:59Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
AutoData: A Multi-Agent System for Open Web Data Collection [37.832257245199365]
AutoData is a novel multi-agent system for Automated web Data collection that requires minimal human intervention.<n>Instruct2DS is a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports.
arXiv Detail & Related papers (2025-05-21T04:32:35Z)
Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z)
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents [16.161877699225986]
We develop a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date. This dataset contains over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. We demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++.
arXiv Detail & Related papers (2025-02-17T02:13:48Z)
Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z)
AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages. We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z)
A Responsive Framework for Research Portals Data using Semantic Web Technology [0.6798775532273751]
The research aims to address this issue by designing a framework for the semantic organization of research portal data. The framework focuses on the extraction of information from two specific research portals, namely Microsoft Academic and IEEE Xplore.
arXiv Detail & Related papers (2023-06-20T16:12:33Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
SnapMode: An Intelligent and Distributed Large-Scale Fashion Image Retrieval Platform Based On Big Data and Deep Generative Adversarial Network Technologies [2.280980014008583]
It is nearly impossible for humans to manually catch up with the ongoing evolution and the continuously variable content in this domain. This paper first proposes a scalable focused Web engine based on the distributed computing platforms to extract and process fashion data on e-commerce websites. For the real-life implementation of the proposed solution, a Web-based application is developed on Apache Storm, Kafka, Solr, and Milvus platforms to create a fashion search engine called SnapMode.
arXiv Detail & Related papers (2022-04-08T11:08:03Z)
A Large Visual, Qualitative and Quantitative Dataset of Web Pages [4.5002924206836]
We have created a large dataset of 49,438 Web pages. It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics.
arXiv Detail & Related papers (2021-05-15T01:31:25Z)
MIRA: Leveraging Multi-Intention Co-click Information in Web-scale Document Retrieval using Deep Neural Networks [5.963438927897287]
We study the problem of deep recall model in industrial web search. We propose a web-scale Multi-Intention Co-click document Graph. We also present an encoding framework MIRA based on Bert and graph attention networks.
arXiv Detail & Related papers (2020-07-03T06:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.