Related papers: Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

URL: http://arxiv.org/abs/2508.21788v1
Date: Fri, 29 Aug 2025 17:04:20 GMT
Title: Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
Authors: Inés Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy,
Abstract summary: This project presents a framework for indexing and analyzing large language training datasets using an ElasticSearch-based pipeline.<n>We apply it to SwissAI's FineWeb-2 corpus, achieving fast query performance--most searches in milliseconds, all under 2 seconds.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

Related papers

Efficient Crawling for Scalable Web Data Acquisition (Extended Version) [4.64103400183613]
SB-CLASSIFIER is a crawler that efficiently learns which hyperlinks lead to pages that link to many targets.<n>We show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
arXiv Detail & Related papers (2026-02-12T12:23:53Z)
Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z)
WebDS: An End-to-End Benchmark for Web-based Data Science [59.270670758607494]
WebDS is the first end-to-end web-based data science benchmark.<n>It comprises 870 web-based data science tasks across 29 diverse websites.<n>WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
arXiv Detail & Related papers (2025-08-02T06:39:59Z)
DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval [44.99833362998488]
The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025.<n>Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time.<n>Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05.
arXiv Detail & Related papers (2025-07-11T07:23:08Z)
GneissWeb: Preparing High Quality Data for LLMs at Scale [15.90638423128482]
We introduce GneissWeb, a large dataset yielding around 10 trillion tokens.<n>GneissWeb achieves a favorable trade-off between data quality and quantity.<n>We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0.
arXiv Detail & Related papers (2025-02-19T00:14:29Z)
Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract) [73.57710917145212]
Learning to rank is widely employed in web searches to prioritize pertinent webpages based on input queries. We propose a emphulineGenerative ulineSemi-ulineSupervised ulinePre-trained (GS2P) model to address these challenges. We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine.
arXiv Detail & Related papers (2024-09-25T03:39:14Z)
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z)
Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z)
A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank. It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries. It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z)
Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query. We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.