CrediBench: Building Web-Scale Network Datasets for Information Integrity
- URL: http://arxiv.org/abs/2509.23340v3
- Date: Thu, 02 Oct 2025 14:03:57 GMT
- Title: CrediBench: Building Web-Scale Network Datasets for Information Integrity
- Authors: Emma Kondrup, Sebastian Sabry, Hussein Abdallah, Zachary Yang, James Zhou, Kellin Pelrine, Jean-François Godbout, Michael M. Bronstein, Reihaneh Rabbany, Shenyang Huang,
- Abstract summary: CrediBench is a large-scale data processing pipeline for constructing temporal web graphs.<n>Our approach captures the dynamic evolution of general misinformation domains.<n>From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores.
- Score: 27.562742270396086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persuasive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
Related papers
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces [5.150606279179606]
WebChain is the largest open-source dataset of human-annotated trajectories on real-world websites.<n>Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.
arXiv Detail & Related papers (2026-03-05T15:37:34Z) - ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction [0.0]
We introduce ScrapeGraphAI-100k, a large-scale dataset of real-world LLM extraction events.<n>Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains.<n>We characterize the datasets structural diversity and its failure modes as schema complexity.
arXiv Detail & Related papers (2026-02-16T20:56:59Z) - SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning [48.376164461507244]
We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
arXiv Detail & Related papers (2025-10-02T09:27:15Z) - WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.58638285105971]
This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports.<n>We introduce WebWeaver, a novel dual-agent framework that emulates the human research process.
arXiv Detail & Related papers (2025-09-16T17:57:21Z) - Organize the Web: Constructing Domains Enhances Pre-Training Data Curation [129.27104172458363]
We develop a framework for organizing web pages in terms of both their topic and format.<n>We automatically annotate pre-training data by distilling annotations from a large language model into efficient curations.<n>Our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods.
arXiv Detail & Related papers (2025-02-14T18:02:37Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Exploiting the Semantic Knowledge of Pre-trained Text-Encoders for Continual Learning [63.48785461956983]
Continual learning allows models to learn from new data while retaining previously learned knowledge.<n>The semantic knowledge available in the label information of the images, offers important semantic information that can be related with previously acquired knowledge of semantic classes.<n>We propose integrating semantic guidance within and across tasks by capturing semantic similarity using text embeddings.
arXiv Detail & Related papers (2024-08-02T07:51:44Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains [3.659498819753633]
We develop a website credibility classification and discovery system that integrates webgraph and social media contexts.<n>We introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines.<n>We release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.
arXiv Detail & Related papers (2024-06-17T11:22:04Z) - TIE: Topological Information Enhanced Structural Reading Comprehension
on Web Pages [31.291568831285442]
We propose a Topological Information Enhanced model (TIE) to transform the token-level task into a tag-level task.
TIE integrates Graph Attention Network (GAT) and Pre-trained Language Model (PLM) to leverage the information.
Experimental results demonstrate that our model outperforms strong baselines and achieves both logical structures and spatial structures.
arXiv Detail & Related papers (2022-05-13T03:21:09Z) - Twitter Referral Behaviours on News Consumption with Ensemble Clustering
of Click-Stream Data in Turkish Media [2.9005223064604078]
This study investigates the readers' click activities in the organizations' websites to identify news consumption patterns following referrals from Twitter.
The investigation is widened to a broad perspective by linking the log data with news content to enrich the insights.
arXiv Detail & Related papers (2022-02-04T09:57:13Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - MIRA: Leveraging Multi-Intention Co-click Information in Web-scale
Document Retrieval using Deep Neural Networks [5.963438927897287]
We study the problem of deep recall model in industrial web search.
We propose a web-scale Multi-Intention Co-click document Graph.
We also present an encoding framework MIRA based on Bert and graph attention networks.
arXiv Detail & Related papers (2020-07-03T06:32:48Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Siamese Graph Neural Networks for Data Integration [11.41207739004894]
We propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles.
Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible.
We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.
arXiv Detail & Related papers (2020-01-17T21:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.